The Inverse Protein Folding Problem
Download
Report
Transcript The Inverse Protein Folding Problem
Canada-China Industrial Workshop, 2005
Hong Kong Baptist University
The Inverse Protein Folding
Problem*
Arvind Gupta
Simon Fraser University
May 24, 2005
*Joint work with J. Manuch, C. Mead, L. Stacho, B. Bhattacharyya, X. Huang
Outline
•
•
•
•
•
•
Background
Forces in Protein Folding
Hydrophobic-Polar Model
Protein Databank
Determining Attributes of the Ideal Lattice
Future Steps
DNA
•
•
•
•
Genetic code
A “string” of nucleotides over A C G T
Code for all proteins
Self-replicating
Proteins
• A “string” over 20 amino acids
• In solvent will fold into a unique 3D spatial
structure with minimal energy
Protein Structure
• Structure determines protein function.
• Proteins normally are in an aqueous environment
• Proteins are globular.
Proteins in the body
• Proteins are involved in all processes in the
body:
Insulin
Hemoglobin
Proteins and diseases
M. Thorpe, Protein Folding, HIV and Drug Design, Physics and
Technology Forefronts (2003).
Forward Protein Folding Problem
• Identify the protein structure for a specific
amino acid sequence.
MAGWTRLS..
• Central open problem in biology
• NP-hard under most models
Inverse Protein Folding Problem
• Given a structure (or a functionality) identify an
amino acid sequence whose fold will be that
structure (exhibit that functionality).
• Crucial problem in drug design.
• NP-hard under most models.
Forces acting on Proteins
•
•
•
•
•
Hydrogen Bonding
Van der Waals interactions
Ion pairing
Disulfide bonds
Intrinsic properties
Hydro (water)
philic (loving)
(conformational preference)
phobic (fearing)
• Hydrophobicity: the dominant
force in protein folding (Dill, 1990)
Hydrophobic Interactions
• Each amino acid can be classified as either
hydrophobic or hydrophilic (polar)
• Hydrophobic [Polar] are in a higher [lower]
energy state in an aqueous environment.
Hydrophobic – Polar (HP) Model
•
•
•
•
•
•
•
Introduced by Dill (1985) and Chan (1985)
“0” for polar; “1” for hydrophobic
Protein sequence embedded on lattice
Each amino acid in exactly one cell
Interactions across adjacent cells
Empty lattice cells contain water
Given protein maximize hydrophobic interactions
(native fold).
• IE: Given 0-1 string embed onto a lattice,
maximizing adjacent 1’s.
The 2-D Square Lattice
Protein:
• Hydrophobic “1”:
• Peptide bond:
• Example.
Polar “0”:
Hydrophobic interaction:
Inverse protein folding
• Problem: For a given shape find a protein
(amino acid string) with a native fold
approximating the shape.
• Example.
Constructible structures
Theorem: For any constructible structure S, there
exists a protein p(S) with a native fold exactly
filling the structure S.
• Proof by induction:
– Base case:
p(S)=010010010010
Constructible structures
Theorem: For any constructible structure S, there
exists a protein p(S) with a native fold exactly
filling the structure S.
• Proof by induction:
– Inductive case:
Constructible structures
Theorem: For any constructible structure S, there
exists a protein p(S) with a native fold exactly
filling the structure S.
• Proof by induction:
– Inductive case:
Constructible structures
Theorem: For any constructible structure S, there
exists a protein p(S) with a native fold exactly
filling the structure S.
• Proof:
– Folds are saturated: every hydrophobic “1” is involved
in two hydrophobic interactions
– saturated implies native
Stability of proteins
• Proteins is stable if it has unique “native fold”
(fold with minimal energy).
• Most natural proteins are stable.
• The protein in our example is not stable:
Together 82 native folds!
Stability of proteins
Conjecture: For any constructible structure S,
the protein p(S) is stable.
• Tested for >20,000 constructible structures.
• Mathematically proved for two simple infinite
classes of constructible structures L0 and L1.
L 0:
L 1:
Boundary squares
• Diagonal frame: the smallest diagonal
rectangle containing all hydrophobic “1”-s.
• Boundary square: hydrophobic “1” lying on the
border of diagonal frame.
5 boundary squares
Boundary squares
• Useful to find the last tile of constructible
structure.
• A saturated fold has at least 4 of them.
Lemma. Let p=0{0,1}*0 be a protein
string not containing 11, 000 and
10101 as a substring. For every
saturated fold of p, each
boundary square not adjacent to
a terminal is the main square of a
corner-closed core.
Proof for L0 structures
• Take a saturated fold for p(S), L0.
• It has at least 4 boundary squares, and at least 2 not
adjacent to a terminal (the first or the last amino acid).
• By Lemma, each is contained in a corner-closed core,
i.e., is a red 1 of substring 1001001 of the protein
string.
• In p(S)=0(10010)n(01001)n0, there are only two
occurrences of substring 1001001, and they are
overlapping.
• Hence, cores match each other and form a fully-closed
core (closed on 3 sides) - the last tile.
• Cut the last tile and apply induction.
L1 structures are more complex
• p(S)=0(10010)n010(10010)m(01001)m01(01001)n-10
• p(S) contains one occurrence of substring
10101 (Lemma cannot be directly applied) and
three occurrences of 1001001 (two cornerclosed cores does not imply a fully-closed
core).
Choosing a Lattice
• 2D is easier
Fewer options for combinatorial case analysis
More visually intuitive
Torsion angles describe protein mainchain
• 3D is more relevant
More biologically relevant
More representative of actual protein
structures
Directly applicable to known protein structures
Protein Data Bank
(PDB)
• Worldwide repository for
3-D biological macromolecular structure data
• Contains 30857 known protein structures (May17,2005)
• Structures derived using different techniques
– Nuclear Magnetic Resonance spectroscopy
– X-ray crystallography
• PDB ‘known structures’ are really models of the
structure of a protein
Determining Ideal Lattice Attributes
1. Should all edges of the lattice be identical
in length?
2. How should distances between nonadjacent lattice points behave?
3. What angles should the lattice have?
4. How regular should the lattice be?
Use PDB statistics to answer these questions
Assemble a Set of Proteins
Create a protein structure subset of good
quality protein structures from the PDB:
a) Protein structures generated using X-ray
diffraction
b) High resolution structures (<= 1.75 Å)
c) Model fits the experimental data well
Result: 3704 Protein structures in subset
Q1: Uniform Edge Length?
Overall distribution of consecutive residue distance:
Consecutive residue
distance appears
consistently with length
3.8 Å.
Answer to Question 1: All edge lengths should be
uniform with length 3.8 Å.
Q2: Non-adjacent Vertex
Distances?
Overall distribution of non-consecutive
residue distance:
• minimum distance: 3.06 Å
• only 10 distances < 3.5Å
• 1813 distances < 3.8Å
(out of 426 billion pairs).
Answer to Question 2: Non-adjacent vertices should
be at least 3.8 Å apart.
Q3: Lattice Angles?
One amino acid
Amino acid chain
Q3: Lattice Angles?
Overall distribution of Ca angles:
• Calculate Ca angles: angle produced by
three consecutive Ca atoms
• Group results by middle amino acid
residue type
Bimodal distribution:
• Sharp peak at 90o
• Shallow peak at 120o
Q3: Lattice Angles?
Some differences
appear for Ca angles
around certain amino
acids:
Shown: Proline, Phenylalanine,
Aspartic acid
Q4: Lattice Regularity?
• Determine average corresponding coordinate root
square mean deviation (c-RMS) values between the
original PDB structure and lattice approximated
structures
(over the entire 3704 PDB protein subset)
n
c - RMS
| ai bi |
2
i 1
n
ai = coordinates of lattice vertex corresponding to bi
bi = coordinates of residue in protein X-ray structure
Q4: Lattice Regularity?
• Periodic Lattices: Cubic and Face-Centered-Cubic (FCC)
• Randomized Lattices: Shift each vertex in periodic
lattices by a random value from normal (0, 0.0025)
distribution, preserve edges
• De Novo Random Lattices: Generate random nodes and
edges, maintain average degree and edge length of periodic
lattices
Q4: Lattice Regularity?
• average c-RMS values generally increase as the
randomization of the lattices increase
lattice
degree
model
FCC
Cubic
12
6
average c-RMS
periodic
lattice
1.82
3.11
Randomized
de novo
periodic lattice random lattice
1.967
3.21
4.85
3.96
Answer to Question 4: Periodic lattices achieve
better approximation of protein structure than
random lattices of the same degree
Results: Ideal Lattice Attributes
•
•
•
•
Uniform edge lengths of 3.8Å
Mimimum distance between any two
vertices of 3.8Å
Supporting mainly 90o and 120o angles
Periodic in structure
Candidate lattices (space-filling)
cubic
hex. prism
truncated
tetrahedron
cuboctahedron
truncated
octahedron
Candidate lattices (vector-based)
Face-centered
cubic (FCC)
Side+FCC
(S+FCC)
Extended FCC
(e-FCC)
RMS comparison of lattices
c-RMS
d-RMS
a-RMS
Truncated
Octahedron
5.3053
3.2479
13.0982
Hexagonal Prism
3.8704
2.4312
10.0313
Truncated
Tetrahedron
3.6913
2.4133
19.9030
Simple Cubic
3.1123
2.1081
21.1005
Cubeoctahedron
2.5581
1.7427
8.3526
FCC
1.8212
1.4369
8.3346
S+FCC
2.1791
1.5819
6.2022
e-FCC
1.5385
1.1048
2.5700
Angle comparison of lattices
Cubo
Trunc. Hexago Trunc.
Cub ctaLattice octahed nal
tetrahed
FCC
ic
hedro
ron
prism
ron
n
S+FCC e-FCC
Degree
4
5
6
6
8
12
18
42
Close
ness
to 90
20
18
42
18
30
30
28.82
31.40
Close
ness
to 120
10
24
36
36
34.29
32.73
36.47
38.72
Future
1. Investigate candidate lattices to determine
an ideal lattice for inverse protein folding
2. Mathematically prove that the ideal lattice
can generate stable sequences for
specified protein shapes within the HP
model
3. Attempt to assign specific amino acids to
lattice sites
Future
4. Investigate protein sequences generated
by the model for stability and folding
properties.
5. Incorporate other protein folding forces
–
–
–
–
–
Hydrogen Bonding
Van der Waals interactions
Intrinsic properties (conformational preference)
Ion pairing
Disulfide bonds
Questions?