Transcript Psi-blast

Structural Bioinformatics
Protein Tertiary
Structure Prediction
The different levels of Protein Structure
Primary: amino acid linear sequence.
Secondary: -helices, β-sheets and loops.
Tertiary: the 3D shape of the fully folded
polypeptide chain
The 3D structure of a protein is
stored in a coordinate file
Each atom is represented by
a coordinate in 3D (X, Y, Z)
The coordinate file can be viewed
graphically
RBP
Description is given in slides 35-36
Predicting 3D Structure
Outstanding difficult problem
Based on sequence homology
– Comparative modeling (homology)
Based on structural homology
– Fold recognition (threading)
Comparative Modeling
Similar sequences suggests similar structure
Sequence and Structure alignments of two Retinol Binding Protein
How do we evaluate structure
similarity??
Structure Alignment
Structure Alignments
There are many different algorithms for structural Alignment.
The outputs of a structural alignment are a
superposition of the atomic coordinates and a
minimal Root Mean Square Distance (RMSD)
between the structures.
The RMSD of two aligned structures indicates
their divergence from one another.
Atom N (x, y, z)
Atom N (x, y, z)
Atoms in Protein V
Atoms in Protein W
Low values of RMSD mean similar structures
Comparative Modeling
Similar sequence suggests similar structure
Builds a protein structure model based on
its alignment (sequence) to one or more
related protein structures in the database
Can we use comparative
modeling for any given
sequence?
Comparative Modeling
• Accuracy of the comparative model is
usually related to the sequence identity on
which it is based
>50% sequence identity = high accuracy
30%-50% sequence identity= 90% can be modeled
<30% sequence identity =low accuracy (many errors)
However other parameters (such as identify length)
can influence the results
Comparative Modeling
Modeling of a sequence based on known structures
Consist of four major steps :
1. Finding a known structure(s) related to the sequence
to be modeled (template), using sequence comparison
methods such as PSI-BLAST
2. Aligning sequence with the templates
3. Building a model
4. Assessing the model
What is a good model?
What is a good model?
What is a good model?
Fold Recognition
Protein Folds: sequential and spatial
arrangement of secondary structures
Globin
TIM
Similar folds usually mean similar function
Homeodomain
Transcription
factors
The same fold can have multiple functions
Rossmann
12 different
functions
TIM barrel
31 different
functions
Fold Recognition
• Fold recognition attempt to detect similarities
between protein 3D structure that have no
significant sequence similarity.
• Search for folds that are compatible with a
particular sequence.
Basic steps in Fold Recognition :
Compare sequence against a Library of all known Protein Folds (finite number)
Query sequence
MTYGFRIPLNCERWGHKLSTVILKRP...
Goal: find to what folding template the sequence fits best
There are different ways to evaluate sequence-structure fit
There are different ways to evaluate sequence-structure fit
1)
...
56)
...
MAHFPGFGQSLLFGYPVYVFGD...
-10
...
...
n)
...
-123
...
Potential fold
20.5
Fold Recognition
• Fold recognition attempt to detect similarities
between protein 3D structure that have no
significant sequence similarity.
• Search for folds that are compatible with a
particular sequence.
• "the turn the protein folding problem on it's head”
rather than predicting how a sequence will fold,
they predict how well a fold will fit a sequence
Ab Initio Modeling
• Compute molecular structure from laws of
physics and chemistry alone
Theoretically Ideal solution
Practically nearly impossible
WHY ?
– Exceptionally complex calculations
– Biophysics understanding incomplete
How do we know what is a good prediction ???
CASP - Critical Assessment of Structure Prediction
• Competition among different groups for resolving
the 3D structure of proteins that are about to be
solved experimentally.
• Current state – ab-initio - the worst, but greatly improved in the last
years.
– Modeling - performs very well when homologous
sequences with known structures exist.
– Fold recognition - performs well.
What can you do?
FOLDIT
Solve Puzzles for Science
A computer game to fold proteins
http://fold.it/portal/puzzles
What’s Next
Predicting function from structure
Protein structures give us insight into
protein function and mechanism of action
protein complexes
fold
Biologic processes
Evolutionary
relationship
Location
Of mutants , SNPs
Shape and electrostatics
Active sites
Protein-ligand complexes
Functional sites
Classical approach for function prediction
new structure
?

similar
function
Given a protein structure
can we predict the function of a
protein when we do not have a known
homolog in the database ?
A different approach for predicting
function from structure which does not rely on
homology
• To characterize the known protein structures
belonging to a specific family
• Find general structural features which are
unique to the family
• Use these features to predict new members of
the family
EXAMPLE :
Predicting new DNA-binding
proteins
p53
Many DNA-binding proteins are involved in cancer
Many different folds but all can bind DNA
Helix-Turn-Helix
Zinc-Finger
Leucine zippers
b-ribbon
While DNA-binding proteins have diverse folds
they all share a common property:
All have positive charged surfaces
Complementing the negative charge of the DNA
Positive
(Blue)
Negative
(red)
DNA-binding proteins are characterized by positive
charged surfaces
Positive
(Blue)
Negative
(red)
But so do proteins that don’t bind nucleic acids
Strategy for predicting new
DNA-binding proteins
1. Build a database of DNA-binding and non
DNA-binding proteins
2. Extract the positive electrostatic patch in all
proteins in Data Set.
3. Find features that could be used to discriminate
the DNA-binding proteins from other proteins.
3. Use the features as a vector to train a machine
learning algorithm to identify novel DNAbinding proteins
Machine learning algorithm
for predicting protein function from structural
features
• SVM (Support Vector Machine) is trained on a set of
known proteins that have a common function such as DNA
binding (red dots), and in addition, a separate set of
proteins that are known not to bind DNA (blue dots)
39
• Using this training set of DNA and non-DNA binding
protein, an SVM would learn to differentiate between the
members and non-members of the family
?
• Having learned the features of the class (DNA binding
proteins), the SVM could recognize a new protein as members
or as non-members of the class based on the combination of its
structural features.
40
Testing the algorithm for predicting
DNA-binding proteins
DNA binding
100
Non‘DNA binding
80
True Positive = 44
True Negative = 236
False Positive = 10
False Negative = 14
60
40
incorrect
correct
incorrect
0
correct
20
Pymol example
•
•
•
•
•
•
•
•
•
Launch Pymol
Open file “1aqb” (PDB coordinate file)
Display sequence
Hide everything
Show main chain / hide main chain
Show cartoon
Color by ss
Color red
Color green, resi 1:40
Help : http://pymol.org