disorder_part1
Download
Report
Transcript disorder_part1
Prediction of protein disorder
Zsuzsanna Dosztányi
MTA-ELTE Momentum Bioinformatics Group
Department of Biochemistry Eotvos Lorand University,
Budapest, Hungary
[email protected]
Protein Structure/Function Paradigm
Dominant view: 3D structure is a prerequisite for protein function
But….
Heat stability
Protease sensitivity
Failed attempts to crystallize
Lack of NMR signals
Increased molecular volume
“Freaky” sequences …
IDPs
Intrinsically disordered proteins/regions
(IDPs/IDRs)
Do not adopt a well-defined structure in
isolation under native-like conditions
Highly flexible ensembles
Functional proteins
p53 tumor suppressor
transactivation
TAD
Disordered region
DNA-binding
DBD
tetramerization regulation
TD
RD
Disordered
region
Wells et al. PNAS 2008; 105: 5762
Bioinformatics of protein disorder
Part 1 Prediction of protein disorder
Databases
Prediction of protein disorder
Part 2 Biology of disordered proteins
Prediction of functional regions within IDPs
Datasets
Ordered proteins in the PDB
over 100000 structures
few 1000s folds
Some structures in the PDB classify as disordered!
only adopt a well-defined structure in complex
in crystals, with cofactors, proteins, …
Disorder in the PDB
Missing electron density regions from the PDB
NMR structures with large structural variations
Less than 10% of all positions
Usually short (<10 residues), often at the termini
Disprot
www.disprot.org
Current release: 6.02
Release date: 05/24/2013
Number of proteins: 694
Number of disordered regions: 1539
Experimentally verified disordered
proteins collected from literature
(X-ray, NMR, CD, proteolysis, SAXS,
heat stability, gel filtration, …)
Additional databases
Combining experiments and predictions
Genome level annotations
MobiDB: http://mobidb.bio.unipd.it
D2P2: http://d2p2.pro
IDEAL: http://www.ideal.force.cs.is.nagoya-u.ac.jp/IDEAL
Sequence properties of disordered proteins
Amino acid compositional bias
High proportion of polar and charged amino acids
(Gln, Ser, Pro, Glu, Lys)
Low proportion of bulky, hydrophobhic amino acids
(Val, Leu, Ile, Met, Phe, Trp, Tyr)
Low sequence complexity
Signature sequences identifying disordered proteins
Protein disorder is encoded in the amino acid sequence
Amino acid compositions
He et al. Cell Res. 2009; 19: 929
Prediction methods for protein disorder
Over 50 methods
Based on amino acid propensity scales or on simplified
biophysical models
GlobPlot, FoldIndex, FoldUnfold, IUPred, UCON
Machine learning approaches
PONDR VL-XT, VL3, VSL2; Disopred; POODLE S and L ; DisEMBL;
DisPSSMP; PrDOS, DisPro, OnD-CRF, POODLE-W, RONN
1.Amino acid propensity scale
GlobPlot
Compare the tendency of amino acids:
to be in coil (irregular) structure.
to be in regular secondary structure elements
Linding (2003) NAR 31, 3701
GlobPlot
From position specific predictions
Where are the ordered domains?
Longer disordered segments?
Noise vs. real data
GlobPlot: http://globplot.embl.de/
downhill regions correspond to
putative domains (GlobDom)
up-hill regions
correspond to predicted
protein disorder
Globular proteins
Large entropy penalty
Large number of inter-residue contacts
2. Physical principles
IUPred
If a residue cannot form enough favorable
interactions within its sequential environment,
it will not adopt a well defined structure
it will be disordered
Based on an energy estimation method
Parameters calculated from statistics of globular proteins
No training on disordered proteins
Dosztanyi (2005) JMB 347, 827
IUPred
The algorithm:
…PSVEPPLSQETFSDL WKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPRVA PAPAAPTPAA...
Based only on the composition of environment of D’s
we try to predict if it is in a disordered region or not:
Amino acid
composition
of environment:
A – 10%
C – 0%
D – 12 %
E – 10 %
F–2%
etc…
Estimate the
interaction
energy between
the residue and
its environment
Decide the
probability of the
residue being
disordered
based on this
3. Machine learning approaches
INPUT
.
A
T
V
Q
L
S
M
I
W
Q
S
T
R
.
OUTPUT
D
O
DISOPRED2:
…..AMDDLMLSPDDIEQWFTED…..
SVM with linear kernel
Assign label: D or O
F(inp)
D
O
Ward (2004) JMB 337, 635
DISOPRED2
Cutoff value!
PONDR VSL2
Differences in short and long disorder
amino acid composition
methods trained on one type of dataset tested on
other dataset resulted in lower efficiencies
PONDR VSL2: separate predictors for short and long
disorder combined
length independent predictions
Peng (2006) BMC Bioinformatics 7, 208
4. Metaservers:
Disorder prediction methods
Meta-predictor
PONDR VLXT
PONDR VL3
PONDR VSL2
Sequence
IUPred
ANN
Prediction
FoldIndex
TopIDP
Xue et al. Biochem Biophys Acta. 2010; 180: 996
Disordered regions and secondary structure
Coil is an ordered, irregular structural element
Disordered proteins usually do not contain stable secondary
structural elements
(e.g. by CD)
They can contain transient secondary structure elements
(by NMR)
Pure random coil never occurs
Use secondary structure predictions methods for
disordered proteins with extreme caution
Long segments without predicted secondary structure may
indicate proteins disorder (NORsnet)
Accuracy
•True positive: Disordered residues are predicted as disordered
•False positive: Ordered residues predicted as disordered
•True negative: Ordered residues predicted as ordered
•False negative: Disordered residues predicted as ordered
75-90%
Prediction of protein disorder
Disordered residues can be predicted from
the amino acid sequence
Methods can be specific to certain type of
disorder
~ 80% at the residue level
accordingly, accuracies vary depending on
datasets
Predictions are based on binary
classification of disorder
Heterogeneity in protein disorder
Transient
structures
Flexible loop
RC-like
Compact
Modularity in proteins
Many proteins contains multiple domains
Composed of ordered and disordered segments
Average length of a PDB chain is < 300
Average length of a human proteins ~ 500
Average length of cancer-related proteins > 900
Structural properties of full length proteins …
Practical