Slides 3 - Department of Computer and Information Science and

Download Report

Transcript Slides 3 - Department of Computer and Information Science and

CAP5510 – Bioinformatics
Protein Structures
Tamer Kahveci
CISE Department
University of Florida
1
What and Why?
•
•
•
Proteins fold into a three
dimensional shape
Structure can reveal functional
information that we can not find
from sequence
Misfolding proteins can cause
diseases
Hemoglobin
– Sickle cell anemia, mad cow
disease
•
Used in drug design
Normal v.s. sickled blood cells
E→V
HIV protease
inhibitor
2
Goals
• Understand protein structures
• Primary, secondary, tertiary
• Learn how protein shapes are
– determined
– Predicted
• Structure comparison (?)
3
A Protein Sequence
>gi|22330039|ref|NP_683383.1| unknown protein; protein id: At1g45196.1 [Arabidopsis
thaliana]
MPSESSYKVHRPAKSGGSRRDSSPDSIIFTPESNLSLFSSASVSVDRCSSTSDAHDRDDSLISAWKEEFEVKKDDESQNL
DSARSSFSVALRECQERRSRSEALAKKLDYQRTVSLDLSNVTSTSPRVVNVKRASVSTNKSSVFPSPGTPTYLHSMQKGW
SSERVPLRSNGGRSPPNAGFLPLYSGRTVPSKWEDAERWIVSPLAKEGAARTSFGASHERRPKAKSGPLGPPGFAYYSLY
SPAVPMVHGGNMGGLTASSPFSAGVLPETVSSRGSTTAAFPQRIDPSMARSVSIHGCSETLASSSQDDIHESMKDAATDA
QAVSRRDMATQMSPEGSIRFSPERQCSFSPSSPSPLPISELLNAHSNRAEVKDLQVDEKVTVTRWSKKHRGLYHGNGSKM
4
Amino Acid Composition
• Basic Amino Acid
Structure:
– The side chain, R,
varies for each of
the 20 amino acids
Side chain
R
H
O
N C C
H
Amino
group
H
OH
Carboxyl
group
5
The Peptide Bond
• Dehydration synthesis
• Repeating backbone: N–C –C –N–C –C
O
O
– Convention – start at amino terminus and proceed to carboxy
terminus
6
Peptidyl polymers
• A few amino acids in a chain are called a polypeptide. A
protein is usually composed of 50 to 400+ amino acids.
• We call the units of a protein amino acid residues.
carbonyl
carbon
amide
nitrogen
7
Side chain properties
• Carbon does not make hydrogen bonds with
water easily – hydrophobic
• O and N are generally more likely than C to hbond to water – hydrophilic
• We group the amino acids into three general
groups:
– Hydrophobic
– Charged (positive/basic & negative/acidic)
– Polar
8
The Hydrophobic Amino Acids
9
The Charged Amino Acids
10
The Polar Amino Acids
11
More Polar Amino Acids
And then there’s…
12
Planarity of the Peptide Bond
Psi () – the
angle of
rotation about
the C-C bond.
Phi () – the
angle of
rotation about
the N-C bond.
The planar bond angles and bond
lengths are fixed.
13
Primary & Secondary Structure
• Primary structure = the linear sequence of amino
acids comprising a protein:
AGVGTVPMTAYGNDIQYYGQVT…
• Secondary structure
– Regular patterns of hydrogen bonding in proteins
result in two patterns that emerge in nearly every
protein structure known: the -helix and the
-sheet
– The location of direction of these periodic, repeating
structures is known as the secondary structure of the
protein
14
The Alpha Helix

 60°
15
Properties of the Alpha Helix
•     60°
• Hydrogen bonds
between C=O of
residue n, and
NH of residue
n+4
• 3.6 residues/turn
• 1.5 Å/residue rise
• 100°/residue turn
16
Properties of -helices
• 4 – 40+ residues in length
• Often amphipathic or “dual-natured”
– Half hydrophobic and half hydrophilic
• If we examine many -helices,
we find trends…
– Helix formers: Ala, Glu, Leu, Met
– Helix breakers: Pro, Gly, Tyr, Ser
17
The beta strand (& sheet)
   135°
  +135°
18
Properties of beta sheets
• Formed of stretches of 5-10
residues in extended
conformation
• Parallel/aniparallel,
contiguous/non-contiguous
19
Anti-Parallel Beta Sheets
20
Parallel Beta Sheets
21
Mixed Beta Sheets
22
Turns and Loops
• Secondary structure elements are
connected by regions of turns and loops
• Turns – short regions of non-, non-
conformation
• Loops – larger stretches with no
secondary structure.
– Sequences vary much more than secondary
structure regions
23
Ramachandran Plot
24
Levels of
Protein
Structure
• Secondary structure
elements combine to
form tertiary structure
• Quaternary structure
occurs in
multienzyme
complexes
25
Protein Structure Example
Beta Sheet
Helix
Loop
ID: 12as
2 chains
26
Views of a Protein
Wireframe
Ball and stick
27
Views of a protein
Spacefill
Cartoon
CPK colors
Carbon = green,
black, or grey
Nitrogen = blue
Oxygen = red
Sulfur = yellow
Hydrogen = white
28
Common Protein Motifs
29
Mostly Helical Folding Motifs
• Four helical
bundle:
 Globin domain:
30
/ Motifs
• / barrel:
31
Open Twisted
Beta Sheets
32
Beta Barrels
33
Determining the Structure of a
Protein
Experimental Methods
•X-ray
•NMR
As of August 2013, structure of > 85,000 proteins are determined
34
X-Ray Crystallography
Discovery of X-rays
(Wilhelm Conrad Röntgen, 1895)
Crystals diffract X-rays in regular patterns
(Max Von Laue, 1912)
The first X-ray diffraction pattern from a
protein crystal
(Dorothy Hodgkin, 1934)
35
X-Ray Crystallography
• Grow millions of protein
crystals
– Takes months
• Expose to radiation beam
• Analyze the image with
computer
– Average over many copies
of images
• PDB
• Not all proteins can be
crystallized!
36
NMR
• Nuclear Magnetic Resonance
• Nuclei of atoms vibrate when exposed to
oscillating magnetic field
• Detect vibrations by external sensors
• Computes inter-atomic distances.
• Requires complex analysis. NMR can be used
for short sequences (<200 residues)
• More than one model can be derived from NMR.
37
Determining the Structure of a
Protein
Computational Methods
38
The Protein Folding Problem
• Central question of molecular biology:
“Given a particular sequence of amino
acid residues (primary structure), what will
the secondary/tertiary/quaternary structure
of the resulting protein be?”
• Input: AAVIKYGCAL…
Output: 11, 22…
39
Structure v.s. Sequence
• Observation: A protein with the same
sequence (under the same circumstances)
yields the same shape.
• Protein folds into a shape that minimizes
the energy needed to stay in that shape.
• Protein folds in ~10-15 seconds.
40
Secondary Structure
Prediction
41
Chou-Fasman methods
• Uses statistically obtained Chou-Fasman
parameters.
• For each amino acid has
– P(a): alpha
– P(b): beta
– P(t): turn
– f(): additional turn parameter.
42
Chou-Fasman Parameters
Name
Alanine
Arginine
Aspartic Acid
Asparagine
Cysteine
Glutamic Acid
Glutamine
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
Abbrv
A
R
D
N
C
E
Q
G
H
I
L
K
M
F
P
S
T
W
Y
V
P(a)
142
98
101
67
70
151
111
57
100
108
121
114
145
113
57
77
83
108
69
106
P(b) P(turn)
83
66
93
95
54
146
89
156
119
119
37
74
110
98
75
156
87
95
160
47
130
59
74
101
105
60
138
60
55
152
75
143
119
96
137
96
147
114
170
50
f(i)
0.06
0.07
0.147
0.161
0.149
0.056
0.074
0.102
0.14
0.043
0.061
0.055
0.068
0.059
0.102
0.12
0.086
0.077
0.082
0.062
f(i+1)
0.076
0.106
0.11
0.083
0.05
0.06
0.098
0.085
0.047
0.034
0.025
0.115
0.082
0.041
0.301
0.139
0.108
0.013
0.065
0.048
f(i+2)
0.035
0.099
0.179
0.191
0.117
0.077
0.037
0.19
0.093
0.013
0.036
0.072
0.014
0.065
0.034
0.125
0.065
0.064
0.114
0.028
f(i+3)
0.058
0.085
0.081
0.091
0.128
0.064
0.098
0.152
0.054
0.056
0.07
0.095
0.055
0.065
0.068
0.106
0.079
0.167
0.125
43
0.053
C.-F. Alpha Helix Prediction (1)
A
E
A
T
T
L
C
M
Q
S
T
Y
C
Y
V
P(a)
142
151
142
83
83
121
70
145
111
77
83
69
70
69
106
P(b)
83
37
83
119
119
130
119
105
110
75
119
147
119
147
170
• Find P(a) for all letters
• Find 6 contiguous letters, at least 4 of them have P(a) >
100
• Declare these regions as alpha helix
44
C.-F. Alpha Helix Prediction (2)
A
E
A
T
T
L
C
M
Q
S
T
Y
C
Y
V
P(a)
142
151
142
83
83
121
70
145
111
77
83
69
70
69
106
P(b)
83
37
83
119
119
130
119
105
110
75
119
147
119
147
170
• Extend in both directions until 4 consecutive
letters with P(a) < 100 found
45
C.-F. Alpha Helix Prediction (3)
A
E
A
T
T
L
C
M
Q
S
T
Y
C
Y
V
P(a)
142
151
142
83
83
121
70
145
111
77
83
69
70
69
106
P(b)
83
37
83
119
119
130
119
105
110
75
119
147
119
147
170
• Find sum of P(a) (Sa) and sum of P(b) (Sb) in the
extended region
– If region is long enough ( >= 5 letters) and P(a) > P(b) then
declare the extended region as alpha helix
46
C.-F. Beta Sheet Prediction
• Same as alpha helix replace P(a) with P(b)
• Resolving overlapping alpha helix & beta
sheet
– Compute sum of P(a) (Sa) and sum of P(b)
(Sb) in the overlap.
– If Sa > Sb => alpha helix
– If Sb > Sa => beta sheet
47
C.-F. Turn Prediction
A
E
A
T
T
L
C
M
Q
S
T
Y
C
Y
V
P(a)
142
151
142
83
83
121
70
145
111
77
83
69
70
69
106
P(b)
83
37
83
119
119
130
119
105
110
75
119
147
119
69
170
P(t)
66
74
66
96
96
59
119
60
98
143
96
114
119
114
50
i
i+1
i+2
i+3
f()
• An amino acid is predicted as turn if all of the following
holds:
– f(i)*f(i+1)*f(i+2)*f(i+3) > 0.000075
– Avg(P(i+k)) > 100, for k=0, 1, 2, 3
– Sum(P(t)) > Sum(P(a)) and Sum(P(b)) for i+k, (k=0, 1, 2, 3)
48
Other Methods for SSE Prediction
• Similarity searching
– Predator
• Markov chain
• Neural networks
– PHD
• ~65% to 80% accuracy
49
Tertiary Structure Prediction
50
Forces driving protein folding
• It is believed that hydrophobic collapse is
a key driving force for protein folding
– Hydrophobic core
– Polar surface interacting with solvent
•
•
•
•
Minimum volume (no cavities)
Disulfide bond formation stabilizes
Hydrogen bonds
Polar and electrostatic interactions
51
Fold Optimization
• Simple lattice models
(HP-models or
Hydrophobic-Polar
models)
– Two types of residues:
hydrophobic and polar
– 2-D or 3-D lattice
– The only force is
hydrophobic collapse
– Score = number of
HH contacts
52
Scoring Lattice Models
• H/P model scoring: count noncovalent hydrophobic
interactions.
• Sometimes:
– Penalize for buried polar or surface hydrophobic residues
53
Can we use lattice models?
• For smaller polypeptides, exhaustive search can
be used
– Looking at the “best” fold, even in such a simple
model, can teach us interesting things about the
protein folding process
• For larger chains, other optimization and search
methods must be used
– Greedy, branch and bound
– Evolutionary computing, simulated annealing
54
The “hydrophobic zipper” effect
Ken Dill ~ 1997
55
Representing a lattice model
• Absolute directions
– UURRDLDRRU
• Relative directions
– LFRFRRLLFFL
– Advantage, we can’t have
UD or RL in absolute
– Only three directions: LRF
• What about bumps?
LFRRR
– Bad score
– Use a better representation
56
Preference-order representation
• Each position has two
“preferences”
– If it can’t have either of the
two, it will take the “least
favorite” path if possible
• Example: {LR},{FL},{RL},
{FR},{RL},{RL},{FL},{RF}
• Can still cause bumps:
{LF},{FR},{RL},{FL},
{RL},{FL},{RF},{RL},
{FL}
57
More realistic models
• Higher resolution lattices (45° lattice, etc.)
• Off-lattice models
– Local moves
– Optimization/search methods and /
representations
• Greedy search
• Branch and bound
• EC, Monte Carlo, simulated annealing, etc.
58
How to Evaluate the Result?
• Now that we have a more realistic off-lattice
model, we need a better energy function to
evaluate a conformation (fold).
• Theoretical force field:
– G = Gvan der Waals + Gh-bonds + Gsolvent + Gcoulomb
• Empirical force fields
– Start with a database
– Look at neighboring residues – similar to known
protein folds?
59
Comparative Modeling
1. Identify similar protein sequences from a
database of known proteins (BLAST)
2. Find conserved regions by aligning these
proteins (CLUSTAL-W)
3. Predict alpha helices and beta sheets from
conserved regions, backbone
4. Predict loops
5. Predict side chain positions
6. Evaluate
60
Threading: Fold recognition
• Given:
– Sequence:
IVACIVSTEYDVMKAAR…
– A database of molecular
coordinates
• Map the sequence onto
each fold
• Evaluate
– Objective 1: improve
scoring function
– Objective 2: folding
61
Folding : still a hard problem
• Levinthal’s paradox – Consider a 100
residue protein. If each residue can take
only 3 positions, there are 3100 = 5  1047
possible conformations.
– If it takes 10-13s to convert from 1 structure to
another, exhaustive search would take 1.6 
1027 years.
62
Protein Classification
• Class: Similar secondary structure properties
– All alpha, all beta, alpha/beta, alpha+beta
• Fold: major secondary structure similarity.
– Globin like (6 helices, folded leaf, partly opened)
• Super family: distant homologs. 25-30%
sequence identity.
• Family: close homologs. Evolved from the same
ancestor. High identity.
63