Bioinformatics Research and Resources at the University of

Download Report

Transcript Bioinformatics Research and Resources at the University of

Introduction to Bioinformatics: Lecture XI
Computational Protein Structure Prediction
Jarek Meller
Division of Biomedical Informatics,
Children’s Hospital Research Foundation
& Department of Biomedical Engineering, UC
JM - http://folding.chmcc.org
1
Outline of the lecture





Protein structure and complexity of
conformational search: from similarity based
methods to de novo structure prediction
Multiple sequence alignment and family profiles
Secondary structure and solvent accessibility
prediction
Matching sequences with known structures:
threading and fold recognition
Ab initio folding simulations
JM - http://folding.chmcc.org
2
Polypeptide chains: backbone and side-chains
N-ter
C-ter
JM - http://folding.chmcc.org
3
Distinct chemical nature of amino acid side-chains
C-ter
PHE
N-ter
CYS
VAL
ARG
GLU
JM - http://folding.chmcc.org
4
Hydrogen bonds and secondary structures
b-strand
a-helix
JM - http://folding.chmcc.org
5
Tertiary structure and long range contacts: annexin
JM - http://folding.chmcc.org
6
Quaternary structure and protein-protein
interactions: annexin hexamer
JM - http://folding.chmcc.org
7
Domains, interactions, complexes:
cyclin D and Cdk
Cyclin Box
JM - http://folding.chmcc.org
8
Domains, interactions, complexes: VHL
b
HIF-1a
VHL
Elongin C
a
Elongin B
JM - http://folding.chmcc.org
9
Protein folding problem

The protein folding problem consists of
predicting three-dimensional structure of a
protein from its amino acid sequence
 Hierarchical organization of protein structures
helps to break the problem into secondary
structure, tertiary structure and protein-protein
interaction predictions
 Computational approaches for protein
structure prediction: similarity based and de
novo methods
JM - http://folding.chmcc.org
10
Polypeptide chains: backbone and
rotational degrees of freedom
H
O
R2
|
||
|
NH3+--Ca -- C -- N -- Ca -- C --O|
|
|
\\
R1
H H
O
The equilibrium length of the peptide bond (C -- N) is about 2 [Ang].
The average Ca - Ca distance in a polypeptide chain is about 3.8 [Ang].
The angle of rotation around N - Ca bond is called j, and
the angle around the Ca - C bond is called f.
These two angles define the overall conformation of polypeptide chains.
Simplifying, there are three discrete states (rotations) for each of these
single bonds, implying 9N possible backbone conformations.
JM - http://folding.chmcc.org
11
Scoring alternative conformations with
empirical force fields (folding potentials)
Ideally, each misfolded structure should have
an energy higher than the native energy, i.e. :
E
Emisfolded - Enative > 0
misfolded
native
JM - http://folding.chmcc.org
12
Ab initio (or de novo) folding simulations





When dealing with a new fold, the similarity base
methods cannot be applied
Ab initio folding simulations consist of conformational
search with an empirical scoring function (“force field”)
to be maximized (or minimized)
Computational bottleneck: exponential search space
and sampling problem (global optimization!)
Fundamental problem: inaccuracy of empirical force
fields
Importance of mixed protocols, such as Rosetta by D.
Baker and colleagues (more when Monte Carlo
protocols for global optimization are introduced)
JM - http://folding.chmcc.org
13
Similarity based approaches to structure prediction:
from sequence alignment to fold recognition





High level of redundancy in biology: sequence similarity is often
sufficient to use the “guilt by association” rule: if similar sequence then
similar structure and function
Multiple alignments and family profiles can detect evolutionary
relatedness with much lower sequence similarity, hard to detect with
pairwise sequence alignments: Psi-BLAST by S. Altschul et. al.
For sufficiently close proteins one may superimpose the backbones
using sequence alignment and then perform conformational search (with
the backbone fixed) to find the optimal geometry (according to atomistic
empirical force field) of the side-chains: homology modeling (e.g.
Modeller by A. Sali et. al.)
Many structures are already known (see PDB) and one can match
sequences directly with structures to enhance structure recognition: fold
recognition
For both, fold recognition and de novo simulation, prediction of
intermediate attributes such secondary structure or solvent
accessibility helps to achieve better sensitivity and specificity
JM - http://folding.chmcc.org
14
Protein families and domains
The notion of protein family is derived from evolutionary considerations:
members of the same family are related, perform the same function and
are assumed to have diverged from the same ancestor.
The notion of domain is derived from structural considerations:
“A domain is defined as an autonomous structural unit, or a reusable
sequence unit that may be found in multiple protein contexts”, Baterman et. al.
PFAM (7246 families as of April 2004):
http://www.sanger.ac.uk/Software/Pfam/
PRODOM:
http://prodes.toulouse.inra.fr/prodom/current/html/home.php
CDD:
http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi
Check: pfam00134.11, Cyclin_N
JM - http://folding.chmcc.org
15
Multiple alignment and PSSM
JM - http://folding.chmcc.org
16
Multiple alignment, clustering and families



DP search gives optimal solution scaling
exponentially with the number of sequences K,
O(nK), not practical for more than 3,4 sequences.
Standard heuristics start from pairwise alignments
(e.g. PsiBLAST, Clustalw)
Hidden Markov Model approach to family profiles
(profile HMM) as an alternative with pre-fixed
parameters, trained separately for each family.
Some initial multiple alignments necessary for
training (next lecture).
JM - http://folding.chmcc.org
17
Predicting 1D protein profiles from sequences:
secondary structures and solvent accessibility
a) Multiple alignment and family profiles improve prediction of local
structural propensities.
b) Use of advanced machine learning techniques, such as Neural
Networks or Support Vector Machines improves results as well.
B. Rost and C. Sander were first to achieve more than 70%
accuracy in three state (H, E, C) classification, applying a) and b).
SABLE server
http://sable.cchmc.org
POLYVIEW server
http://polyview.cchmc.org
JM - http://folding.chmcc.org
18
Predicting 1D protein profiles from sequences:
secondary structures and solvent accessibility
JM - http://folding.chmcc.org
19
Predicting transmembrane domains
JM - http://folding.chmcc.org
20
“Hydropathy” profiles and membrane domains prediction
Problem Design a simple algorithm for finding putative transmembrane regions based on “hydropathy” (or hydrophobicity)
profiles. Consider an extension based on prototypes and k-NN.
JM - http://folding.chmcc.org
21
Predicting transmembrane domains
JM - http://folding.chmcc.org
22
Going beyond sequence similarity:
threading and fold recognition
When sequence similarity is not
detectable use a library of known
structures to match your query
with target structures.
As in case of de novo folding,
one needs a scoring function
that measures compatibility
between sequences and structures.
JM - http://folding.chmcc.org
23
Why “fold recognition”?

Divergent (common ancestor) vs. convergent
(no ancestor) evolution
 PDB: virtually all proteins with 30% seq.
identity have similar structures, however most
of the similar structures share only up to 10%
of seq. identity !


www.columbia.edu/~rost/Papers/1997_evolution/paper.html (B.
Rost)
www.bioinfo.mbb.yale.edu/genome/foldfunc/ (H. Hegyi, M.
Gerstein)
JM - http://folding.chmcc.org
24
Simple contact model for protein structure prediction
Each amino acid is represented by a point in 3D space and two amino acids are
said to be in contact if their distance is smaller than a cutoff distance, e.g. 7 [Ang].
JM - http://folding.chmcc.org
25
Sequence-to-structure matching with contact models

Generalized string matching problem: aligning a string
of amino acids against a string of “structural sites”
characterized by other residues in contact

Finding an optimal alignment with gaps using interresidue pairwise models:
E = S k< l e k l ,
is NP-hard because of the non-local character of scores
at a given structural site (identity of the interaction
partners may change depending on location of gaps in
the alignment)
R.H. Lathrop, Protein Eng. 7 (1994)
JM - http://folding.chmcc.org
26
Hydrophobic contact model and
sequence-to-structure alignment
HPHPP
Solutions to this yet another instance of the global optimization problem:
a) Heuristic (e.g. frozen environment approximation)
b) “Profile” or local scoring functions (folding potentials)
JM - http://folding.chmcc.org
27
Using sequence similarity, predicted secondary structures
and contact potentials: fold recognition protocols
In practice fold recognition methods are often mixtures of sequence
matching and threading, e.g., with compatibility between a sequence
and a structure measured by contact potentials and predicted
secondary structures compared to the secondary structure of a
template).
D.Fischer and D. Eisenberg, Curr. Opinion in Struct. Biol. 1999, 9: 208
JM - http://folding.chmcc.org
28
Some fold recognition servers

PsiBLAST (Altschul SF et. al., Nucl. Acids Res. 25: 3389)

Live Bench evaluation (http://BioInfo.PL/LiveBench/1/) :
1.
FFAS (L. Rychlewski, L. Jaroszewski, W. Li, A. Godzik (2000),
Science 9: 232) : seq. profile against profile
2.
3D-PSSM (Kelley LA, MacCallum RM, Sternberg JE, JMB 299: 499 ) : 1D3D profile combined with secondary structures and
solvation potential
3.
GenTHREADER (Jones DT, JMB 287: 797) : seq. profile
combined with pairwise interactions and solvation
potential

LOOPP: annotations of remote homologs
Protein
http://www.tc.cornell.edu/CBIO/loopp
JM - http://folding.chmcc.org
29