secstruct_and_sign_pep_PT

Transcript secstruct_and_sign_pep_PT

Protein Analysis Workshop 2012
Secondary
Structure Prediction
and Signal Peptides
Bioinformatics group
Institute of Biotechnology
University of helsinki
Earlier version: Hung Ta
Current: Petri Törönen
Why Sec. Struct. Predictions and
signal peptides?



Usually sequence homology represents good
source of information
However sometimes one does not get good
homology
We need other sources of information to aid us
•
•
•
•

Domain (profile) homologies (later lectures)
Secondary structure
Signal peptides
Transmembrane regions
Sec.Struct. And signal peptides also good
information for other bioinformatics tools
Secondary Structure


Alternative when only weak sequence homology
Structure more conserved than sequence
 Similar
sec. struct. gives extra support for weak
sequence homology

Special cases of sec. struct. can suggest
function or localization
Hierachy of Protein Structure
Primary Structure: a Linear Arrangement
of Amino Acids



An amino acid has several structural components: a central carbon atom
(Ca), an amino group (NH2), a carboxyl group (COOH), a hydrogen atom
(H), a side chain (R). There are 20 amino acids
The peptide bond is formed as the cacboxyl group of an aa bind to the
amino group of the adjacent aa.
The primary structure of a protein is simply the linear arrangement, or
sequence, of the amino acid residues that compose it
Secondary Structure: Core Elements of
Protein Architecture

resulted from the folding of localized parts of a
polypeptide chain.

α-helix
}

β-sheet

Coils, turns,
major internal supportive elements, 60 percent of the
polypeptide chain
α-Helix

Hydrogen-bonded

3.6 residues per turn

Axial dipole moment

Side chains point outward

Average length is 10 amino acids
(3 turns).

Typically, rich of Analine,
Glutamine, Leucine, Methione;
and poor of Proline, Glycine,
Tyrosine and Serine.
β-Sheet

Formed due to hydrogen bonds
between β-strands which are short
polypeptide segments (5-8
residues).

Adjacent β-strands run in the
same directions -> parallel sheet.

Adjacent β-strands run in the
oposite directions -> anti-parallel
sheet.
Ribbon diagram
Turns, loops, coils…

A turn, composed of 3-4 residues, forms
sharp bends that redirect the polypeptide
backbone back toward the interior.

A loop is similar with turns but can form
longer bends

Turns and loops help large proteins fold into
compact structures.

A random coil is a class of conformations
that indicate an absence of regular
secondary structure.
Turn
Secondary Structure Prediction

Why: the first level of structural organization.

The tasks:
?
aa
Primary:
•
H: α-helix
•
E: β- strand
•
T: turn
•
C: coil
MSEGEDDFPRKRTPWCFDDEHMC
Secondary: CCHHHHHHCCCCEEEEEECCCCC
Secondary Structure Prediction
Single residue statistical analysis (Chou-Fasman -1974):

For each amino acid type, assign its ‘propensity’ to be in a helix, βsheet, or coil.

Based on 15 proteins of known conformation, 2473 total amino
acids.

Limited accuracy: ~55-60% on average.

Eg: Chou-Fasman (1974), not used any more
Secondary Structure Prediction
Segment-based statistics:

Look for correlations (within 11-21 aa windows).

Many algorithms have been tried.

Most performant: Neural Networks:

Input: a number of protein sequences with their known secondary
structure.

Output: a trained network that predicts secondary structure elements for
given query sequences.

Accuracy < 70%.
Popular Servers for
Secondary Structure Prediction



Jpred (http://www.compbio.dundee.ac.uk/wwwjpred/ )
Psipred (http://bioinf.cs.ucl.ac.uk/psipred/ )
Metaserver PredictProtein
(http://www.predictprotein.org/ ).
PSIPRED and JPRED
Test with uniprot|P00772|ELA1_PIG
Elastase-1 precursor
 Correct answer:
http://www.uniprot.org/uniprot/P00772

PSIPRED
(http://bioinf.cs.ucl.ac.uk/psipred/result/351083)
JPRED
(http://www.compbio.dundee.ac.uk/www-jpred/results/jp_Pt7zBV4/jp_Pt7zBV4.results.html)
•Above the summary
•On the right the
Detailed view
Special Cases of Secondary
Structure

Informative special cases of secondary
structures. These include:
 Coiled Coil regions
 Transmembrane regions
Prediction of coiled-coils
• Coiled-coil protein are often biologically
relevant regulators (Transcription Factors)
• Coiled-coils are generally solvent exposed
multi-stranded helix structures:
two-stranded
Helix periodicity and solvent exposure impose
special pattern of heptad repeat:
… abcdefg …
Helical diagram of
2 interacting helices:
 hydrophobic residues
 hydrophilic residues
(From Wikipedia Leucine zipper article)
The COILS server at EMBnet



Compares a sequence to a database of known,
parallel two-stranded coiled-coils, and derives a
similarity score.
By comparing this score to the distribution of
scores in globular and coiled-coil proteins, the
program then calculates the probability that the
sequence will adopt a coiled-coil conformation.
Options:
•
•
•
scoring matrices,
window size (score may vary),
weighting options.
COILS Limitations
The program works well for parallel twostranded structures that are solventexposed but runs progressively into
problems with the addition of more helices,
their antiparallel orientation and their
decreasing length.
 The program fails entirely on buried
structures.

COILS Demo
Let us submit the sequence
>1jch_A
VAAPVAFGFPALSTPGAGGLAVSISAGALSAAIADIMAALKGPFKFGLWGVALYGVLPSQ
IAKDDPNMMSKIVTSLPADDITESPVSSLPLDKATVNVNVRVVDDVKDERQNISVVSGVP
MSVPVVDAKPTERPGVFTASIPGAPVLNISVNNSTPAVQTLSPGVTNNTDKDVRPAFGTQ
GGNTRDAVIRFPKDSGHNAVYVSVSDVLSPDQVKQRQDEENRRQQEWDATHPVEAAERNY
ERARAELNQANEDVARNQERQAKAVQVYNSRKSELDAANKTLADAIAEIKQFNRFAHDPM
AGGHRMWQMAGLKAQRAQTDVNNKQAAFDAAAKEKSDADAALSSAMESRKKKEDKKRSAE
NNLNDEKNKPRKGFKDYGHDYHPAPKTENIKGLGDLKPGIPKTPKQNGGGKRKRWTGDKG
RKIYEWDSQHGELEGYRASDGQHLGSFDPKTGNQLKGPDPKRNIKKYL
to the COILS server at EMBnet:
http://www.ch.embnet.org/software/COILS_form.html
Correct answer:
http://www.rcsb.org/pdb/explore/explore.do?structureId=1JCH
Correct answer:
http://www.rcsb.org/pdb/explore/explore.do?structureId=1JCH
Transmembrane Region Prediction
Transmembrane proteins are important receptor
or transport proteins.
Transmembrane regions:
 Usually contain residues with hydrophobic side
chains (surface must be hydrophobic).
 Usually ~20 residues long, can be up to 30 if
not perpendicular through membrane.
Methods:
 Hydropathy plots (historical, better methods now available)
 Threading (TMpred, MEMSAT),
 Hidden Markov Model (TMHMM),
 Neural Network (PHDhtm).
Hydropathy Plots (Kyte-Doolittle)

The hydropathy index of an amino acid is a number
representing the hydrophobic or hydrophilic properties of
its side-chain

compute an average hydropathy value for each position
in the query sequence,

window length of 19 usually chosen for membranespanning region prediction.
•Skip this
Hydropathy Plot Servers
•Skip this
Let us submit the sequence
>sp|P06010|RCEM_RHOVI Reaction center protein M chain (Photosynthetic reaction
center M subunit) - Rhodopseudomonas viridis.
ADYQTIYTQIQARGPHITVSGEWGDNDRVGKPFYSYWLGKIGDAQIGPIYLGASGIA
AFAFGSTAILIILFNMAAEVHFDPLQFFRQFFWLGLYPPKAQYGMGIPPLHDGGWWL
MAGLFMTLSLGSWWIRVYSRARALGLGTHIAWNFAAAIFFVLCIGCIHPTLVGSWSE
GVPFGIWPHIDWLTAFSIRYGNFYYCPWHGFSIGFAYGCGLLFAAHGATILAVARFG
GDREIEQITDRGTAVERAALFWRWTIGFNATIESVHRWGWFFSLMVMVSASVGILLT
GTFVDNWYLWCVKHG AAPDYPAYLPATPDPASLPGAPK
to
 Membrane Explorer (also as standalone MPEx),
 Grease (http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=misc1)
Remove the FASTA header, if seq reading is not working.
Hydropathy Plot
 The larger the number is, the more

hydrophobic the amino acid
Correct answer (http://pir.uniprot.org/uniprot/P06010)
•Skip this
TM Pred
Method summary:
Scans a candidate sequence for matches
to a sequence scoring matrix, obtained by
aligning the sequences of all
transmembrane alpha-helical regions that
are known from structures.
 These sequences are collected in a
database called TMBase.

Remark: Authors do not suggest this method for genomic
sequences. Automatic methods recommended, eg,
TMHMM, PHDhtm.
TM Pred Server
Let us submit RCEM_RHOVI again
>sp|P06010|RCEM_RHOVI Reaction center protein M chain (Photosynthetic reaction
center M subunit) - Rhodopseudomonas viridis.
ADYQTIYTQIQARGPHITVSGEWGDNDRVGKPFYSYWLGKIGDAQIGPIYLGASGIA
AFAFGSTAILIILFNMAAEVHFDPLQFFRQFFWLGLYPPKAQYGMGIPPLHDGGWWL
MAGLFMTLSLGSWWIRVYSRARALGLGTHIAWNFAAAIFFVLCIGCIHPTLVGSWSE
GVPFGIWPHIDWLTAFSIRYGNFYYCPWHGFSIGFAYGCGLLFAAHGATILAVARFG
GDREIEQITDRGTAVERAALFWRWTIGFNATIESVHRWGWFFSLMVMVSASVGILLT
GTFVDNWYLWCVKHG AAPDYPAYLPATPDPASLPGAPK
to the TMPred server at EMBnet:
http://www.ch.embnet.org/software/TMPRED_form.html
Meta-Servers
A server which
allows you to obtain many informations
based on your sequence including
structure predictions, motif or domain
search… The predictions are based on
several methods.
 PredictProtein: http://predictprotein.org

The PredictProtein meta-server

For sequence analysis, structure and function prediction. When you submit
any protein sequence PredictProtein retrieves similar sequences in the
database and predicts aspects of protein structure and function

SEG: finds low complexity regions.

ProSite: database of functional motifs, ie, biologically relevant short patterns

ProDom: a comprehensive set of protein domain families automatically generated
from the SWISS-PROT and TrEMBL sequence databases.

PROFsec (PHDsec): secondary structure,

PROFacc (PHDacc): solvent accessibility,

PHDhtm: transmembrane helices.

Sequence database is scanned for similar sequences (Blast, Psi-Blast).

Multiple sequence alignment profiles are generated by weighted dynamic
programming (MaxHom).
PredictProtein Demo
Let´s submit again
>uniprot|P00772|ELA1_PIG Elastase-1 precursor
MLRLLVVASLVLYGHSTQDFPETNARVVGGTEAQRNSWPSQISLQYRSGSSWAHTCGGTL
IRQNWVMTAAHCVDRELTFRVVVGEHNLNQNDGTEQYVGVQKIVVHPYWNTDDVAAGYDI
ALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGWGLTRTNGQLAQTLQQAYLPTVD
YAICSSSSYWGSTVKNSMVCAGGDGVRSGCQGDSGGPLHCLVNGQYAVHGVTSFVSRLGC
NVTRKPTVFTRVSAYISWINNVIASN
to http://predictprotein.org/
For a list of mirror sites:
http://predictprotein.org/newwebsite/doc/mirrors.html
Detailed results Summary view
Results
References

Documentation:
•
•
•


•Skip this
COILS: http://www.ch.embnet.org/software/coils/COILS_doc.html
TMPred: http://www.ch.embnet.org/software/tmbase/TMBASE_doc.html
MPEx: http://blanco.biomol.uci.edu/mpex/MPEXdoc.html
Articles:

B. Rost: Evolution teaches neural networks. In Scientific applications of neural nets. Ed.
J.W.Clark, T.Lindenau, M.L. Ristig, 207-223 (1999).

D.T Jones: Protein Secondary Structure Prediction Based on Position-specific Scoring
Matrices. J.Mol.Biol. 292, 195-202 (1999).

B. Rost: Prediction in 1D: Secondary Structure, Membrane Helices, and Accessibility. In
Structural Bioinformatics (reference below).
Books:

P.E. Bourne, H. Weissig: Structural Bioinformatics. Wiley-Liss, 2003.

A. Tramontano: Protein Structure Prediction. Wiley-VCH, 2006.
Signal Peptides





Short peptide chain that directs the transport of
protein
Peptide chain is located mostly in N or Cterminus
Targets in eukaryotes: ER, nucleus, nucleolus,
mitochonrion, peroxisome
Bacteries use them to secrete proteins
When one does not have the sequence
homology these still can tell the potential
location of the protein => a hint to function
Prediction of signal peptides


Challenge is to determine weak signal from the
background noise
Various machine learning methods used
 Hidden
Markov Models (HMM)
 Neural Networks

Most popular tool: SignalP
 http://www.cbs.dtu.dk/services/SignalP/
Prediction of cellular localizatio n



Tools that predict the cellular localization
automatically
Wolf Psort: http://wolfpsort.org/
TargetP: http://www.cbs.dtu.dk/services/TargetP/
Signal Peptide Database





http://www.signalpeptide.de/
Collection of the information on known and
predicted sign.peptide - protein pairs
Allows search with sequence name and keywords
Advanced search allows limitation of hits to single
species
This is useful when looking for extra information
for the known protein

secstruct_and_sign_pep_PT

Transcript secstruct_and_sign_pep_PT

Directory