copy_of_secstruc

Download Report

Transcript copy_of_secstruc

Protein Analysis Workshop 2006
Secondary
Structure Prediction
Bioinformatics group
Institute of Biotechnology
University of helsinki
Alain Schenkel
Chris Wilton
Overview


Review of protein structure.
Introduction to structure prediction:
•
•

Different approaches.
Prediction of 1D strings of structural elements.
Server/soft review:
•
•
COILS, MPEx, …
The PredictProtein metaserver.
Proteins


Proteins play a crucial role in virtually all biological
processes with a broad range of functions.
The activity of an enzyme or the function of a
protein is governed by the three-dimensional
structure.
H11_MOUSE
histocompatibility antigen
VE2_BPV1
Bovine DNA-binding domain
20 amino acids - the building blocks
Clickable map at: http://www.russell.embl-heidelberg.de/aas/
The Amino Acids - hydrophobic
The Amino Acids - polar
The Amino Acids - charged
Secondary Structure: a-helix
Alpha-helix: 413
Very seldom: 310, 516 (Pi-helix)
Secondary Structure: a-helix

3.6 residues per turn

Axial dipole moment

Hydrogen-bonded

Protein surfaces

Typically, no Proline nor
Glycine (“helix-breaker”)
Secondary Structure: b-sheets
Secondary Structure: b-sheets

Parallel or antiparallel

Alternating side-chains

Connecting loops often
have polar amino acids
Secondary Structure: b-sheets
Terminology

Primary structure:
The sequence of amino acid residues
FTPAVHAFLDKFLAS …
Terminology

Secondary structure:
• A first level of structural organization.
• Provides rigidity.
• The structural form adopted by each aminoacid residue:
H: helix ( alpha )
 E: extended ( beta strand )
 T: turn ( often Proline )
 C: coil ( random, unstructured )

Terminology
 Secondary structure elements (SSE):
• Stretches of residues in H conformation are
•
•
•
•
helical SSEs.
Stretches of residues in E conformation are
beta-strand SSEs.
Stretches of residues in C conformation are
loops or coil.
Turns (T) are isolated residues, usually Proline
or Glycine.
Other notation (in 3 states): L for all but H,E.
Secondary Structure Elements

Example:
one helix, one beta strand, three loops
Primary:
MSEGEDDFPRKRTPWCFDDEHMC
Secondary: CCHHHHHHCCCCEEEEEECCCCC
Terminology
 Tertiary structure:
• The full 3D structure of a single
polypeptide chain.
• Secondary structure elements pack
together to form a structural core.
• Called a protein “fold”.
Terminology
 Quaternary structure:
• How several fully folded protein chains
pack together to form a fully functional
protein.
• Example: 1jch (ribosome inhibitor).
PDB identifier
The Protein Data Bank is the principal
repository for solved structures.
Example: 1jch has 4 chains
The elongated 2-helix structures in the center are called coiled-coils.
Structural classification of folds
For example (CATH):





alpha
beta
alpha+beta
alpha/beta
irregular
More on structural classification next week.
Biochemical classification of folds

Globular proteins:
• in aqueous environment,
• compact fold,
• hydrophobic core and polar surfaces.

Membrane proteins:
• attached to or across the cell membrane,
• hydrophobic surface within membrane.

Fibrous proteins:
• structural role,
• repeat of regular/atypical SSE or irregular structure.
Globular
(2 domains)
Transmembrane
Fibrous
INTRODUCTION TO
STRUCTURE PREDICTION
Why is 3D Structure Important?

A pre-requisite for understanding function
• processes of molecular recognition,
• eg DNA recognition by 2bop.

Catalytic mechanisms of enzymes
• often require key residues to be close together in 3D
space.

Structure is often preserved under evolution
when sequence is not.

Drug design.
Structure Prediction
GPSRYIVDL… ?
Approaches to structure prediction



Ab initio: from physical principles only.
De novo: knowledge-based potentials from PDB.
Fold recognition: thread sequence through known
structures for compatibility.

Homology modeling: use sequence alignment to infer
possible template structure.
More on homology modeling next week.
Prediction in One-Dimension
Simplification: project 3D structure onto strings
of structural assignments. Eg:
• coiled-coils
• membrane helices
• solvent accessibility: residue is buried or exposed
…eeebbbbeebbbbee…
• secondary structure elements:
…HHHLLLEEEEEELLEEE…
If accurate: can be used to improve predictions
of 3D structures (eg, in fold recognition).
A Flow Chart for Structure Prediction
http://speedy.embl-heidelberg.de/gtsp/flowchart2.html
Structure Prediction
Why is structure prediction, and in particular
ab initio prediction, a difficult problem?
• Many degrees of freedom: atoms of all residues and
solvent.
• Problem increases exponentially per residue.
• Remote noncovalent interactions complicate matters.
•
A delicate problem of stability.
• Cannot exhaustively search all possible conformations.
A folding protein does not try all conformations !!
(Levinthal paradox)
Basic Principle of Folding (globular protein)
Pack hydrophobic side chains into the interior
of the molecule, away from solvent. So,
 Hydrophobic residues predominantly within a central

structural core. Tight packing (crystal-like).
Hydrophilic residues predominantly on the protein
surface, exposed to solvent.
But main chain is highly polar. This forces the
formation of SSEs in the core. So,
 Core residues tend to be in SSEs.
 Loops are on the outside of the protein.
Protein Structure and Evolution

Rate of evolution of genomic DNA
sequence reflects degree of functional
constraint.

Protein coding regions evolve much more
slowly than non-coding regions:
• need to maintain stable 3D protein structure,
• need to maintain vital biological function.
Rates of Protein Sequence Evolution

Sequences of highly constrained structures
evolve very slowly (eg: histones).

Less constrained ones evolve more quickly (eg:
immunoglobulins).

In general: response to mutation is structural
change, but many mutations will not (or only
slightly) change the structure
=>
Structure is better conserved than sequence.
Evolution of SSEs and Loops

Residues in the hydrophobic core (SSEs) are
constrained by the need for tight packing:
•

changes rarely accepted - evolution is slow.
Residues on the surface (loops) are less
constrained (simply need to be hydrophilic):
•
aa substitution less restricted – evolution is quicker.
Evolution of Key Residues

Residues with key functional roles will be
conserved.
•
•

Eg: active site residues involved in catalysis.
BUT: gene duplication can lead to change of function
without changing structure.
Residues with key structural role also tend to be
conserved. Eg:
•
•
•
GLY: high conformational flexibility => tight turns,…
PRO: side-chain bounds back to backbone => tight turns.
CYS: disulfide bridges.
Structure Prediction by Homology
Multiple sequence / structure alignments
measure differences in evolutionary rates
of residues, and thus

Contain more information than a single
sequence for applications such as homology
modeling and secondary structure prediction,

Give location of conserved regions and
motifs, residues buried in the protein core or
exposed to solvent, plus important secondary
structures.
More on homology modeling next week.
Secondary Structure Prediction
Three generations:

Single residue statistical analysis:
• For each amino acid type, assign its
•
•
‘propensity’ to be in a helix, sheet, or coil.
Limited accuracy: ~55-60% on average.
Eg: Chou-Fasman (1974), not used any more.
Secondary Structure Prediction

Segment-based statistics:
• Look for correlations (within 11-21 aa windows).
• Many algorithms have been tried.
• Most performant: Neural Networks:
• Input: a number of protein sequences with their known
secondary structure.
• Output: a trained network that predicts secondary
structure elements for given query sequences.
• Accuracy < 70%.
• Eg: GORII, COMBINE.
Neural Networks
prediction for
this residue
3 states
output
prediction
query
trained network
(picture from B.Rost, 1999)
Secondary Structure Prediction

Using information from evolution:
• Compute a sequence profile from a multiple
•
•
•
•
sequence alignment.
Use profile instead of query as input to Neural
Network.
6-8 % points increase in accuracy over Neural
Network only.
Eg:
•
•
PHD/PROF: alignments by MaxHom (B. Rost, 1996/2000)
PSI-PRED: alignments from Psi-Blast (D.T. Jones, 1999)
Accuracy: 72% ± 11%.
Accuracy measured as Q3=
# of correctly predicted 2ndary str. states
total # of residues
Accuracy Illustration
Psi-Pred benchmark
on set of 187 chains.
(D.T. Jones, 1999)
Your query could
be here !!
In particular, accuracy can be as low as 50% for a given
query
=>
Use many different methods and compare answers.
Other Structural Features
There are other structural features that one
can try to predict:
 coiled-coils,
 membrane helices,
 solvent accessibility,
 globularity,
 disulfide bridges,
 confomational switches,
…
POPULAR SERVERS
FOR DEALING WITH
SECONDARY STRUCTURES
• Coiled-coils
• Transmembrane helices
• Secondary structure
• Metaservers
Prediction of coiled-coils
Coiled-coils are generally solvent exposed
multi-stranded helix structures:
two-stranded
Helix periodicity and solvent exposure impose
special pattern of heptad repeat:
… abcdefg …
Helical diagram of
2 interacting helices:
 hydrophobic residues
 hydrophilic residues
(From Wikipedia Leucine zipper article)
The COILS server at EMBnet



Compares a sequence to a database of known,
parallel two-stranded coiled-coils, and derives a
similarity score.
By comparing this score to the distribution of
scores in globular and coiled-coil proteins, the
program then calculates the probability that the
sequence will adopt a coiled-coil conformation.
Options:
•
•
•
scoring matrices,
window size (score may vary),
weighting options.
COILS Limitations
The program works well for parallel twostranded structures that are solventexposed but runs progressively into
problems with the addition of more helices,
their antiparallel orientation and their
decreasing length.
 The program fails entirely on buried
structures.

COILS Demo
Let us submit the sequence
>1jch_A
VAAPVAFGFPALSTPGAGGLAVSISAGALSAAIADIMAALKGPFKFGLWGVALYGVLPSQ
IAKDDPNMMSKIVTSLPADDITESPVSSLPLDKATVNVNVRVVDDVKDERQNISVVSGVP
MSVPVVDAKPTERPGVFTASIPGAPVLNISVNNSTPAVQTLSPGVTNNTDKDVRPAFGTQ
GGNTRDAVIRFPKDSGHNAVYVSVSDVLSPDQVKQRQDEENRRQQEWDATHPVEAAERNY
ERARAELNQANEDVARNQERQAKAVQVYNSRKSELDAANKTLADAIAEIKQFNRFAHDPM
AGGHRMWQMAGLKAQRAQTDVNNKQAAFDAAAKEKSDADAALSSAMESRKKKEDKKRSAE
NNLNDEKNKPRKGFKDYGHDYHPAPKTENIKGLGDLKPGIPKTPKQNGGGKRKRWTGDKG
RKIYEWDSQHGELEGYRASDGQHLGSFDPKTGNQLKGPDPKRNIKKYL
to the COILS server at EMBnet:
http://www.ch.embnet.org/software/COILS_form.html
mtidk matrix, no weights, all window lengths
• Frame probabilities at each
residue.
• Columns: window size of 14,
21, 28 aa.
high probability heptads
Transmembrane Region Prediction
Transmembrane regions:
 Usually contain residues with hydrophobic side
chains (surface must be hydrophobic).
 Usually ~20 residues long, can be up to 30 if
not perpendicular through membrane.
Methods:
 Hydropathy plots (historical, better methods now available)
 Threading (TMpred, MEMSAT),
 Hidden Markov Model (TMHMM),
 Neural Network (PHDhtm).
Hydropathy Plots (Kyte-Doolittle)


compute an average hydropathy value for each
position in the query sequence,
window length of 19 usually chosen for
membrane-spanning region prediction.
•Peaks between scales 1-2?
Hydropathy Plot Servers
 Membrane Explorer (also as standalone MPEx),
 Grease (http://fasta.bioch.virginia.edu/fasta/grease.htm)
Let us submit the sequence
>sp|P06010|RCEM_RHOVI Reaction center protein M chain (Photosynthetic reaction
center M subunit) - Rhodopseudomonas viridis.
ADYQTIYTQIQARGPHITVSGEWGDNDRVGKPFYSYWLGKIGDAQIGPIYLGASGIAA
FAFGSTAILIILFNMAAEVHFDPLQFFRQFFWLGLYPPKAQYGMGIPPLHDGGWWLM
AGLFMTLSLGSWWIRVYSRARALGLGTHIAWNFAAAIFFVLCIGCIHPTLVGSWSEGV
PFGIWPHIDWLTAFSIRYGNFYYCPWHGFSIGFAYGCGLLFAAHGATILAVARFGGDR
EIEQITDRGTAVERAALFWRWTIGFNATIESVHRWGWFFSLMVMVSASVGILLTGTFV
DNWYLWCVKHG AAPDYPAYLPATPDPASLPGAPK
to
http://blanco.biomol.uci.edu/mpex/ (Membrane Explorer)
TM Pred
Method summary:
Scans a candidate sequence for matches
to a sequence scoring matrix, obtained by
aligning the sequences of all
transmembrane alpha-helical regions that
are known from structures.
 These sequences are collected in a
database called TMBase.

Remark: Authors do not suggest this method for genomic
sequences. Automatic methods recommended, eg,
TMHMM, PHDhtm.
TM Pred Server
Let us submit RCEM_RHOVI again
>sp|P06010|RCEM_RHOVI Reaction center protein M chain (Photosynthetic reaction
center M subunit) - Rhodopseudomonas viridis.
ADYQTIYTQIQARGPHITVSGEWGDNDRVGKPFYSYWLGKIGDAQIGPIYLGASGIAA
FAFGSTAILIILFNMAAEVHFDPLQFFRQFFWLGLYPPKAQYGMGIPPLHDGGWWLM
AGLFMTLSLGSWWIRVYSRARALGLGTHIAWNFAAAIFFVLCIGCIHPTLVGSWSEGV
PFGIWPHIDWLTAFSIRYGNFYYCPWHGFSIGFAYGCGLLFAAHGATILAVARFGGDR
EIEQITDRGTAVERAALFWRWTIGFNATIESVHRWGWFFSLMVMVSASVGILLTGTFV
DNWYLWCVKHG AAPDYPAYLPATPDPASLPGAPK
to the TMPred server at EMBnet:
http://www.ch.embnet.org/software/TMPRED_form.html
Annotation for RCEM_RHOVI

Uniprot entry for RCEM_RHOVI:
•
•

Chain M of photosynthetic reaction center.
Integral membrane protein.
Can we see the predicted helices in the
structure?
Let´s try at SCOP.
The Psi-Pred Server
• Secondary structure prediction (PSIPRED)
• Transmembrane topology prediction (MEMSAT)
• Fold recognition (GenTHREADER)
Let´s submit
>uniprot|P00772|ELA1_PIG Elastase-1 precursor
MLRLLVVASLVLYGHSTQDFPETNARVVGGTEAQRNSWPSQISLQYRSGSSWAHTCGGTL
IRQNWVMTAAHCVDRELTFRVVVGEHNLNQNDGTEQYVGVQKIVVHPYWNTDDVAAGYDI
ALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGWGLTRTNGQLAQTLQQAYLPTVD
YAICSSSSYWGSTVKNSMVCAGGDGVRSGCQGDSGGPLHCLVNGQYAVHGVTSFVSRLGC
NVTRKPTVFTRVSAYISWINNVIASN
to http://bioinf.cs.ucl.ac.uk/psipred/
PSIPRED PREDICTION RESULTS
Key
Conf: Confidence (0=low, 9=high)
Pred: Predicted secondary structure (H=helix, E=strand, C=coil)
AA: Target sequence
# PSIPRED HFORMAT (PSIPRED V2.5 by David Jones)
Conf: 978999999997404555676678816988988788877499999934884158982897
Pred: CHHHHHHHHHHHHHCCCCCCCCCCCCEECCEECCCCCCCCEEEEEEECCCCCEEEEEEEE
AA: MLRLLVVASLVLYGHSTQDFPETNARVVGGTEAQRNSWPSQISLQYRSGSSWAHTCGGTL
10
20
30
40
50
60
Conf: 138734320122478742368754345663179827995679998026888865344411
Pred: CCCCEEEEECCCCCCCCCEEEEEEEEEEEECCCCCEEEEEEEEEEECCCCCCCCCCCCCH
AA: IRQNWVMTAAHCVDRELTFRVVVGEHNLNQNDGTEQYVGVQKIVVHPYWNTDDVAAGYDI
70
80
90
100
110
120
Conf: 010005863201367530113433210010268995234110254467622168863110
Pred: HHEECCCCCCEEEEEEEECCCCCCCCCCCCEEEEEEECCCCCCCCCCCCCCEEEEEEEEE
AA: ALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGWGLTRTNGQLAQTLQQAYLPTVD
130
140
150
160
170
180
Conf: 024554202566567752773344343221110467438998993899999972376889
Pred: CHHHHHHHCCCCCCCCCEEEECCCCCCCCCEEECCCCEEEEECCEEEEEEEEEECCCCCC
AA: YAICSSSSYWGSTVKNSMVCAGGDGVRSGCQGDSGGPLHCLVNGQYAVHGVTSFVSRLGC
190
200
210
220
230
240
Conf: 88988779999687678899886049
Pred: CCCCCCEEEEEHHHHHHHHHHHHHCC
AA: NVTRKPTVFTRVSAYISWINNVIASN
250
260
(see later for comparison
with solved structure)
Meta-Servers
A server which
 allows you to obtain predictions from
different parallel methods under one
browser window, eg:
• PredictProtein: http://predictprotein.org

or makes predictions based on several
methods (consensus), eg:
• 3D-Jury: http://bioinfo.pl/meta
• GeneSilico: http://www.genesilico.pl/meta
The PredictProtein meta-server
Sequence motif search:
• ProSite, ProDom, SEG.
 One-Dim structure prediction:
• secondary structure,
• transmembrane helices,
• solvent accessibility,
• globularity,
• disulfide bridge,
• conformational switch.
 Links to a multitude of other servers

(numerous links also from 3D-Jury).
Motif Search at PP



SEG: finds low complexity regions.
ProSite: database of functional motifs, ie,
biologically relevant short patterns.
ProDom: a comprehensive set of protein domain
families automatically generated from the
SWISS-PROT and TrEMBL sequence
databases.
More on domains and protein family
classification next week (ADDA, Pfam etc.).
ProSite: http://au.expasy.org/prosite/
ProDom: http://protein.toulouse.inra.fr/prodom/current/html/home.php
One-Dim predictions at PP

Use information from evolution:
• Sequence database is scanned for similar sequences
•

(Blast, Psi-Blast).
Multiple sequence alignment profiles are generated
by weighted dynamic programming (MaxHom).
The PROF (improved PHD) series:
• PROFsec (PHDsec): secondary structure,
• PROFacc (PHDacc): solvent accessibility,
• PHDhtm: transmembrane helices.
Meta-PP
PredictProtein allows to automatically
submit a query to other servers:

Secondary structure prediction:
•

Membrane helices prediction:
•

Psi-Pred, SAM-T02, Jpred, …
TMHMM, …
Tertiary structure prediction:
•
•
•
Homology: Swiss-Model, 3D-Jigsaw, …
Threading: Superfamily, AGAPE, …
Inter-residue contact prediction: CMAPpro, …
PredictProtein Demo
Let´s submit again
>uniprot|P00772|ELA1_PIG Elastase-1 precursor
MLRLLVVASLVLYGHSTQDFPETNARVVGGTEAQRNSWPSQISLQYRSGSSWAHTCGGTL
IRQNWVMTAAHCVDRELTFRVVVGEHNLNQNDGTEQYVGVQKIVVHPYWNTDDVAAGYDI
ALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGWGLTRTNGQLAQTLQQAYLPTVD
YAICSSSSYWGSTVKNSMVCAGGDGVRSGCQGDSGGPLHCLVNGQYAVHGVTSFVSRLGC
NVTRKPTVFTRVSAYISWINNVIASN
to http://predictprotein.org/
For a list of mirror sites:
http://predictprotein.org/newwebsite/doc/mirrors.html
Let´s explore the results here.
Comparison with solved structure
ELA1_PIG Elastase-1 has a solved structure: 1EST
DSSP: ??????????????????????????CBTCEECCTTTCTTEEEEEEEETTEEEEEEEEEEEETTEEEECSGGGCSCCCEE
PSIP: .HHHHHHHHHHHHH............EE..EE........EEEEEEE.....EEEEEEEE....EEEEE.........EE
PROF: ..HHHHHHHHHHH............EEEE.EE.......EEEEEEEE......EEEEEEEE...EEEEEEEEE.....EE
DSSP: EEESCSBTTSCCSCCEEEEEEEEEECTTCCTTCGGGCCCCEEEEESSCCCCBTTBCCCCCCCTTCCCCTTCCEEEEESCB
PSIP: EEEEEEEEEE.....EEEEEEEEEEE.............HHHEE......EEEEEEEE............EEEEEEE...
PROF EEEEEEE........EEEEEEEEEEE.............EEEEEE........EEEEEE............EEEEEEEE.
DSSP: SSTTCCBCSBCEEEECCEECHHHHTSTTTTGGGSCTTEEEECCSSSSBCCTTCTTCEEEEEETTEEEEEEEEEECBTTBS
PSIP: ...........EEEEEEEEE.HHHHHHH.........EEEE.........EEE....EEEEE..EEEEEEEEEE......
PROF: ..........EEEEEEEEE..................EEEE...............EEEEEE...EEEEEEEE.......
DSSP: SBTTBCEEEEEGGGSHHHHHHHHHTC
PSIP: ......EEEEEHHHHHHHHHHHHH..
PROF: .......EEEEHHHHHHHHHHHH...
DSSP: secondary structure assignment from PDB (Kabsch-Sander, 1983)
• H = alpha helix
• B = residue in isolated beta-bridge
• E = extended strand, participates in beta ladder
• G = 3-helix (3/10 helix)
• I = 5 helix (pi helix)
• T = hydrogen bonded turn
• S = bend
Conclusions




Both predictions agree quite well and are quite
accurate.
But: it may not be as good next time.
=>
Compare predictions from different methods to
check whether there is a consensus.
Use servers that automatically combine different
methods (3D-Jury, ...).
Benchmarks

LiveBench
http://bioinfo.pl/meta/livebench.pl

CASP (critical assessment of structure prediction)
http://predictioncenter.gc.ucdavis.edu/

CAFASP (ca of fully automated structure prediction)
http://www.cs.bgu.ac.il/~dfisher/CAFASP5/index.html
References

Documentation:
•
•
•


COILS: http://www.ch.embnet.org/software/coils/COILS_doc.html
TMPred: http://www.ch.embnet.org/software/tmbase/TMBASE_doc.html
MPEx: http://blanco.biomol.uci.edu/mpex/MPEXdoc.html
Articles:

B. Rost: Evolution teaches neural networks. In Scientific applications of neural nets. Ed.
J.W.Clark, T.Lindenau, M.L. Ristig, 207-223 (1999).

D.T Jones: Protein Secondary Structure Prediction Based on Position-specific Scoring
Matrices. J.Mol.Biol. 292, 195-202 (1999).

B. Rost: Prediction in 1D: Secondary Structure, Membrane Helices, and Accessibility. In
Structural Bioinformatics (reference below).
Books:

P.E. Bourne, H. Weissig: Structural Bioinformatics. Wiley-Liss, 2003.

A. Tramontano: Protein Structure Prediction. Wiley-VCH, 2006.