Protein Structure

Download Report

Transcript Protein Structure

Computational Molecular Biology
Protein Structure: Introduction and
Prediction
Protein Folding
 One of the most important problem in
molecular biology
 Given the one-dimensional amino-acid
sequence that specifies the protein, what is the
protein’s fold in three dimensions?
My T. Thai
[email protected]
2
Overview
 Understand protein structures
 Primary, secondary, tertiary
 Why study protein folding:
 Structure can reveal functional information which
we cannot find from the sequence
 Misfolding proteins can cause diseases: mad cow
disease
 Use in drug designs
My T. Thai
[email protected]
3
Overview of Protein Structure
 Proteins make up about 50% of the mass of the
average human
 Play a vital role in keeping our bodies
functioning properly
 Biopolymers made up of amino acids
 The order of the amino acids in a protein and
the properties of their side chains determine the
three dimensional structure and function of the
protein
My T. Thai
[email protected]
4
Amino Acid
 Building blocks of
proteins
 Consist of:
Side chain
 An amino group (-NH2)
 Carboxyl group (-COOH)
 Hydrogen (-H)
 A side chain group (-R)
attached to the central αcarbon
 There are 20 amino acids
 Primary protein structure
is a sequence of a chain
of amino acids
R
H
O
N C C
H
Amino
group
My T. Thai
[email protected]
H
OH
Carboxyl
group
5
Side chains (Amino Acids)
 20 amino acids have side chains that vary in
structure, size, hydrogen bonding ability, and
charge.
 R gives the amino acid its identity
 R can be simple as hydrogen (glycine) or more
complex such as an aromatic ring (tryptophan)
My T. Thai
[email protected]
6
Chemical Structure of Amino Acids
7
How Amino Acids Become Proteins
Peptide bonds
My T. Thai
[email protected]
8
Polypeptide
 More than fifty amino acids in a chain are called a polypeptide.
 A protein is usually composed of 50 to 400+ amino acids.
 We call the units of a protein amino acid residues.
amide
nitrogen
carbonyl
carbon
My T. Thai
[email protected]
9
Side chain properties
 Carbon does not make hydrogen bonds with water
easily – hydrophobic.
 These ‘water fearing’ side chains tend to sequester themselves
in the interior of the protein
 O and N are generally more likely than C to h-bond to
water – hydrophilic
 Ten to turn outward to the exterior of the protein
My T. Thai
[email protected]
10
My T. Thai
[email protected]
11
Primary Structure
Primary structure: Linear String of Amino Acids
Side-chain
Backbone
... ALA PHE LEU ILE LEU ARG ...
Each amino acid within a protein is referred to as residues
Each different protein has a unique sequence of amino acid
residues, this is its primary structure
My T. Thai
[email protected]
12
Secondary Structure
 Refers to the spatial arrangement of contiguous
amino acid residues
 Regularly repeating local structures stabilized
by hydrogen bonds
 A hydrogen atom attached to a relatively
electronegative atom
 Examples of secondary structure are the α–
helix and β–pleated-sheet
My T. Thai
[email protected]
13
Alpha-Helix
 Amino acids adopt the form of a
right handed spiral
 The polypeptide backbone forms
the inner part of the spiral
 The side chains project outward
 every backbone N-H group
donates a hydrogen bond to the
backbone C = O group
My T. Thai
[email protected]
14
Beta-Pleated-Sheet
 Consists of long polypeptide
chains called beta-strands,
aligned adjacent to each other in
parallel or anti-parallel
orientation
 Hydrogen bonding between the
strands keeps them together,
forming the sheet
 Hydrogen bonding occurs
between amino and carboxyl
groups of different strands
My T. Thai
[email protected]
15
Parallel Beta Sheets
My T. Thai
[email protected]
16
Anti-Parallel Beta Sheets
My T. Thai
[email protected]
17
Mixed Beta Sheets
My T. Thai
[email protected]
18
Tertiary Structure
 The full dimensional structure, describing the
overall shape of a protein
 Also known as its fold
My T. Thai
[email protected]
19
Quaternary Structure
 Proteins are made up of multiple polypeptide chains,
each called a subunit
 The spatial arrangement of these subunits is referred to
as the quaternary structure
 Sometimes distinct proteins must combine together in
order to form the correct 3-dimensional structure for a
particular protein to function properly.
 Example: the protein hemoglobin, which carries
oxygen in blood. Hemoglobin is made of four similar
proteins that combine to form its quaternary structure.
My T. Thai
[email protected]
20
Other Units of Structure
 Motifs (super-secondary structure):
 Frequently occurring combinations of secondary
structure units
 A pattern of alpha-helices and beta-strands
 Domains: A protein chain often consists of
different regions, or domains
 Domains within a protein often perform different
functions
 Can have completely different structures and folds
 Typically a 100 to 400 residues long
My T. Thai
[email protected]
21
What Determines Structure
 What causes a protein to fold in a particular
way?
 At a fundamental level, chemical interactions
between all the amino acids in the sequence
contribute to a protein’s final conformation
 There are four fundamental chemical forces:




Hydrogen bonds
Hydrophobic effect
Van der Waal Forces
Electrostatic forces
My T. Thai
[email protected]
22
Hydrogen Bonds
 Occurs when a pair of nucliophilic atoms such as
oxygen and nitrogen share a hydrogen between them
 Pattern of hydrogen bounding is essential in
stabilizing basic secondary structures
My T. Thai
[email protected]
23
Van der Waal Forces
 Interactions between immediately adjacent
atoms
 Result from the attraction between an atom’s
nucleus and it neighbor’s electrons
My T. Thai
[email protected]
24
Electrostatic Forces
 Oppositely charged side chains con form salt-bridges,
which pulls chains together
My T. Thai
[email protected]
25
Experimental Determination
 Centralized database (to deposit protein
structures) called the protein Databank (PDB),
accessible at
http://www.rcsb.org/pdb/index.html
 Two main techniques are used to
determine/verify the structure of a given
protein:
 X-ray crystallography
 Nuclear Magnetic Resonance (NMR)
 Both are slow, labor intensive, expensive
(sometimes longer than a year!)
My T. Thai
[email protected]
26
X-ray Crystallography
 A technique that can reveal the precise three
dimensional positions of most of the atoms in a
protein molecule
 The protein is first isolated to yield a high
concentration solution of the protein
 This solution is then used to grow crystals
 The resulting crystal is then exposed to an Xray beam
My T. Thai
[email protected]
27
Disadvantages
 Not all proteins can be crystallized
 Crystalline structure of a protein may be
different from its structure
 Multiple maps may be needed to get a
consensus
My T. Thai
[email protected]
28
NMR
 The spinning of certain atomic nuclei generates
a magnetic moment
 NMR measures the energy levels of such
magnetic nuclei (radio frequency)
 These levels are sensitive to the environment of
the atom:
 What they are bonded to, which atoms they are
close to spatially, what distances are between
different atoms…
 Thus by carefully measurement, the structure of
the protein can be constructed
My T. Thai
[email protected]
29
Disadvantages
 Constraint of the size of the protein – an upper
bound is 200 residues
 Protein structure is very sensitive to pH.
My T. Thai
[email protected]
30
Computational Methods
 Given a long and painful experimental
methods, need computational approaches to
predict the structure from its sequence.
My T. Thai
[email protected]
31
Functional Region Prediction
My T. Thai
[email protected]
32
Protein Secondary Structure
My T. Thai
[email protected]
33
Tertiary Structure Prediction
My T. Thai
[email protected]
34
More Details on X-ray
Crystallography
My T. Thai
[email protected]
35
Overview
My T. Thai
[email protected]
36
Overview
My T. Thai
[email protected]
37
Crystal
 A crystal can be defined as an arrangement of
building blocks which is periodic in three
dimensions
My T. Thai
[email protected]
38
Crystallize a Protein
 Have to find the right combination of all the
different influences to get the protein to
crystallize
 This can take a couple hundred or even
thousand experiments
 Most popular way to conduct these experiments
 Hanging-drop method
My T. Thai
[email protected]
39
Hanging drop method
 The reservoir contains a precipitant
concentration twice as high as the
protein solution
 The protein solutions is made up of
50% of stock protein solution and
50% of reservoir solution
 Overtime, water will diffuse from the
protein drop into the reservoir
 Both the protein concentration and
precipitant concentration will increase
 Crystals will appear after days,
weeks, months
My T. Thai
[email protected]
40
Properties of protein crystal
 Very soft
 Mechanically fragile
 Large solvent areas (30-70%)
My T. Thai
[email protected]
41
A Schematic Diffraction Experiment
My T. Thai
[email protected]
42
Why do we need Crystals
 A single molecule could never be oriented and
handled properly for a diffraction experiment
 In a crystal, we have about 1015 molecules in
the same orientation so that we get a
tremendous amplification of the diffraction
 Crystals produce much simpler diffraction
patterns than single molecules
My T. Thai
[email protected]
43
Why do we need X-rays
 X-rays are electromagnetic waves with a
wavelength close to the distance of atoms in the
protein molecules
 To get information about where the atoms are,
we need to resolve them -> thus we need
radiation
My T. Thai
[email protected]
44
A Diffraction Pattern
My T. Thai
[email protected]
45
My T. Thai
[email protected]
46
Resolution
 The primary measure of crystal order/quality of
the model
 Ranges of resolution:
 Low resolution (>3-5 Ao) is difficult to see the side
chains only the overall structural fold
 Medium resolution (2.5-3 Ao)
 High resolution (2.0 Ao)
My T. Thai
[email protected]
47
Some Crystallographic Terms
 h,k,l: Miller indices (like a name of the
reflection)
 I(h,k,l): intensity
 2θ: angle between the x-ray incident beam and
reflect beam
My T. Thai
[email protected]
48
Diffraction by a Molecule in a Crystal
 The electric vector of the X-ray wave forces the
electrons in our sample to oscillate with the
same wavelength as the incoming wave
My T. Thai
[email protected]
49
Description of Waves
My T. Thai
[email protected]
50
Structure Factor Equation
 fj: proportional to the number of electrons this
atom j has
 One of the fundamental equations in X-ray
Crystallography
My T. Thai
[email protected]
51
The Phase
 From the measurement, we can only obtain the
intensity I(hkl) of any given reflection (hkl)
 The phase α(hkl) cannot be measured
My T. Thai
[email protected]
52
How to Determine the Phase
 Small changes are
introduced into the
crystal of the protein of
interest:
 Eg: soaking the crystal
in a solution containing
a heavy atom
compound
 Second diffraction data set needs to be collected
 Comparing two data sets to determine the phases (also able to
localize the heavy atoms)
My T. Thai
[email protected]
53
Other Phase Determination Methods
My T. Thai
[email protected]
54
Electron Density Map
 Once we know the complete diffraction pattern
(amplitudes and phases), need to calculate an
image of the structure
 The above equation returns the electron density
(so we get a map of where the electrons are
their concentration)
My T. Thai
[email protected]
55
Interpretation of Electron Density
 Now, the electron density has to be interpreted in terms
of atom identities and positions.
 (1): packing of the whole molecules is shown in the
crystal
 (2): a chain of seven amino acids in shown with the
resulting structure superimposed
 (3): the electron density of a trypophan side chain is
shown
My T. Thai
[email protected]
56
Refinement and the R-Factor
My T. Thai
[email protected]
57
Nuclear Magnetic Resonance
 Concentrated protein solution (very purified)
 Magnetic field
 Effect of radio frequencies on the resonance of
different atoms is measured.
My T. Thai
[email protected]
58
My T. Thai
[email protected]
59
NMR
 Behavior of any atom is influenced by
neighboring atoms
 more closely spaced residues are more perturbed
than distant residues
 can calculate distances based on perturbation
My T. Thai
[email protected]
60
NMR spectrum of a protein
My T. Thai
[email protected]
61
Computational Molecular Biology
Protein Structure: Secondary Prediction
Primary Structure: Symbolic Definition
 A = {A,C,D,E,F,G,H,I,J,K,L,M,N,P,Q,R,S.T,V,W,Y } –
set of symbols denoting all amino acids
 A* - set of all finite sequences formed out of elements of
A, called protein sequences
 Elements of A* are denoted by x, y, z …..i.e. we write x
A*, y A*, zA*, … etc
 PROTEIN PRIMARY STRUCTURE: any x  A* is also
called a protein sequence or protein sub-unit
My T. Thai
[email protected]
63
Protein Secondary Structure (PSS)
 Secondary structure: the arrangement of the
peptide backbone in space. It is produced by
hydrogen bondings between amino acids
 PROTEIN SECONDARY STRUCTURE
consists of: protein sequence and its hydrogen
bonding patterns called SS categories
My T. Thai
[email protected]
64
Protein Secondary Structure
 Databases for protein sequences are expanding
rapidly
 The number of determined protein structures
(PSS – protein secondary structures) and the
number of known protein sequences is still
limited
PSSP (Protein Secondary Structure
Prediction) research is trying to breach
this gap.
My T. Thai
[email protected]
65
Protein Secondary Structure
The most commonly observed conformations
in secondary structure are:
Alpha Helix
Beta Sheets/Strands
Loops/Turns
My T. Thai
[email protected]
66
Turns and Loops
 Secondary structure elements are connected by
regions of turns and loops
 Turns – short regions of non-, non-
conformation
 Loops – larger stretches with no secondary
structure.
My T. Thai
[email protected]
67
Three secondary structure states
 Prediction methods are normally assessed for 3
states:
 H (helix)
 E (strands)
 L (others (loop or turn))
My T. Thai
[email protected]
68
Secondary Structure
8 different categories:
 H:  - helix
 G: 310 – helix
 I:  - helix (extremely rare)
 E:  - strand
 B:  - bridge
 T: - turn
 S: bend
 L: the rest
My T. Thai
[email protected]
69
Three SS states: Reduction methods
 Method 1, used by DSSP program:
 H(helix) ={ G (310 – helix), H (- helix)}
 E (strands) = {E (-strand), B (-bridge)} ,
 L = all the rest
• Shortly: E,B => E; G,H => H; Rest => C
 Method 2, used by STRIDE program:
 H as in Method 1
 E = {E (-strand), b (isolated  -bridge)},
 L = all the rest
My T. Thai
[email protected]
70
Three SS states: Reduction methods
Method 3, used by DEFINE program:
 H(helix) as in Method 1
 E (strands) = {E (-strand)},
L = all the rest
My T. Thai
[email protected]
71
Example of typical PSS Data
Example:
 Sequence
 KELVLALYDYQEKSPREVTHKKGDILTLLNSTNKDWWKYEY
NDRQGFVP
 Observed SS
 HHHHHLLLLEEEHHHLLLEEEEEELLLHHHHHHHHLLLEEEE
EELLLHHH
My T. Thai
[email protected]
72
PSS: Symbolic Definition
Given A =
{A,C,D,E,F,G,H,I,J,K,L,M,N,P,Q,R,S.T,V,W,Y } –
set of symbols denoting amino acids and
a protein sequence x  A*
Let S ={ H, E, L} be the set of symbols
of 3 states: H (helix), E (strands) and L (loop)
and S* be the set of all finite sequences of
elements of S.
 We denote elements of S* by e, e S*
My T. Thai
[email protected]
73
PSS: Symbolic Definition
 Any one-to-one function
i.e. f  A* x S*
is called a protein secondary structure (PSS)
identification function
An element (x, e)  f is a called protein
secondary structure (of the protein sequence x)
f : A* S*
The element e  S* (of (x, e)  f ) is called
secondary structure.
My T. Thai
[email protected]
74
PSSP
 If a protein sequence shows clear similarity to
a protein of known three dimensional structure
 then the most accurate method of predicting the
secondary structure is to align the sequences by
standard dynamic programming algorithms
Why?
 homology modelling is much more accurate than
secondary structure prediction for high levels of
sequence identity.
My T. Thai
[email protected]
75
PSSP
 Secondary structure prediction methods are of
most use when sequence similarity to a protein
of known structure is undetectable.
 It is important that there is no detectable
sequence similarity between sequences used to
train and test secondary structure prediction
methods.
My T. Thai
[email protected]
76
Classification and Classifiers
 Given a database table DB with a special
atribute C, called a class attribute (or decision
attribute). The values: C1, C2, ...Cn of the class
atrribute are called class labels.
 Example:
A1
1
0
1
A2
1
1
0
A3
m
v
m
My T. Thai
[email protected]
A4
g
g
b
C
c1
c2
c1
77
Classification and Classifiers
 The attribute C partitions the records in the DB:
 divides the records into disjoint subsets defined by the
attributes C values, CLASSIFIES the records.
 It means we use the attributre C and its values to divide
the set R of records of DB into n disjoint classes:
C1={ rDB: C=c1} ...... Cn={rDB: C=cn}
 Example (from our table)
C1 = { (1,1,m,g), (1,0,m,b)} = {r1,r3}
C2 = { (0,1,v,g)} ={r2}
My T. Thai
[email protected]
78
Classification and Classifiers
 An algorithm is called a classification algorithm if it uses the
data and its classification to build a set of patterns.
 Those patterns are structured in such a way that we can use
them to classify unknown sets of objects- unknown records.
 For that reason (because of the goal) the classification
algorithm is often called shortly a classifier.
 The name classifier implies more then just classification
algorithm. A classifier is final product of a data set and a
classification algorithm.
My T. Thai
[email protected]
79
Classification and Classifiers
 Building a classifier consists of two phases:
training and testing.
 In both phases we use data (training data set and disjoint with
it test data set) for which the class labels are known for ALL
of the records.
 We use the training data set to create patterns
 We evaluate created patterns with the use of of test data,
which classification is known.
 The measure for a trained classifier accuracy is called
predictive accuracy.
 The classifier is build i.e. we terminate the process if it has
been trained and tested and predictive accuracy was on an
acceptable level.
My T. Thai
[email protected]
80
Classifiers Predictive Accuracy
 PREDICTIVE ACCURACY of a classifier is
a percentage of well classified data in the
testing data set.
Predictive accuracy depends heavily
on a choice of the test and training
data.
 There are many methods of choosing test and
and training sets and hence evaluating the
predictive accuracy. This is a separate field of
research.
My T. Thai
[email protected]
81
Accuracy Evaluation
 Use training data to adjust parameters of method
until it gives the best agreement between its
predictions and the known classes
 Use the testing data to evaluate how well the
method works (without adjusting parameters!)
 How do we report the performance?
 Average accuracy = fraction of all test examples
that were classified correctly
My T. Thai
[email protected]
82
Accuracy Evaluation
 Multiple cross-validation test has to be
performed to exclude a potential dependency of
the evaluated accuracy on the particular test set
chosen
 Jack-Knife:
 Use 129 chains for setting up the tool (training set)
 1 for estimating the performance (testing)
 This has to be repeated 130 times until each protein
has been used once for testing
 The average over all 130 tests gives an estimate of
the prediction accuracy
My T. Thai
[email protected]
83
PSSP Datasets
 Historic RS126 dataset. Contains126 sub-units with known
secondary structure selected by Rost and Sander. Today is
not used anymore
 CB513 dataset. Contains 513 sub-units with known
secondary structure selected by Cuff and Barton in 1999.
Used quite frencently in PSSP research
 HS17771 dataset. Created by Hobohm and Scharf. In
March-2002 it contained 1771 sub-units
 Lots of authors has their own and “secret” datasets
My T. Thai
[email protected]
84
Measures for PSSP accuracy
 http://cubic.bioc.columbia.edu/eva/doc/measur
e_sec.html (for more information)
 Q3 :Three-state prediction accuracy (percent
of succesful classified)
 Qi %obs: How many of the observed residues
were correctly predicted?
 Qi %prd: How many of the predicted residues
were correctly predicted?
My T. Thai
[email protected]
85
Measures for PSSP Accuracy
 Aij = number of residues predicted to be in structure
type j and observed to be in type i
 Number of residues predicted to be in structure i:
3
ai   Aji
j 1
 Number of residues observed to be in structure i:
3
bi   Aij
j 1
My T. Thai
[email protected]
86
Measures for SSP Accuracy
 The percentage of residues correctly predicted to be in
class i relative to those observed to be in class i
Qi  Q
% obs
i
Aii

 100
bi
 The percentages of residues correctly predicted to be in
class i from all residues predicted to be in i
% pred
i
Q
 Overall 3-state accuracy
Aii

100
ai
3
Q3 
My T. Thai
[email protected]
A
i 1
b
ii
100
87
PSSP Algorithms
There are three generations in PSSP
algorithms
• First Generation: based on statistical information of
single amino acids (1960s and 1970s)
• Second Generation: based on windows (segments)
of amino acids. Typically a window containes 11-21
amino acids (dominating the filed until early 1990s)
• Third Generation: based on the use of windows on
evolutionary information
My T. Thai
[email protected]
88
PSSP: First Generation
 First generation PSSP systems are based on
statistical information on a single amino acid
 The most relevant algorithms:
Chow-Fasman, 1974
GOR, 1978
 Both algorithms claimed 74-78% of predictive
accuracy, but tested with better constructed
datasets were proved to have the predictive
accuracy ~50% (Nishikawa, 1983)
My T. Thai
[email protected]
89
Chou-Fasman method
 Uses table of conformational parameters
determined primarily from measurements of the
known structure (from experimental methods)
 Table consists of one “likelihood” for each
structure for each amino acid
 Based on frequencies of residues in -helices,
-sheets and turns
 Notation: P(H): propensity to form alpha helices
 f(i): probability of being in position 1 (of a turn)
My T. Thai
[email protected]
90
Chou-Fasman Pij-values
Name
P (H)
P (E )
P (t urn)
f (i)
f (i+ 1)
f (i+ 2)
f (i+ 3)
Alanine
142
83
66
0.06
0.076
0.035
0.058
Arginine
98
93
95
0.07
0.106
0.099
0.085
101
54
146
0.147
0.11
0.179
0.081
Asparagine
67
89
156
0.161
0.083
0.191
0.091
Cysteine
70
119
119
0.149
0.05
0.117
0.128
Glutamic Acid
151
37
74
0.056
0.06
0.077
0.064
Glutamine
111
110
98
0.074
0.098
0.037
0.098
Glycine
57
75
156
0.102
0.085
0.19
0.152
Histidine
100
87
95
0.14
0.047
0.093
0.054
Isoleucine
108
160
47
0.043
0.034
0.013
0.056
Leucine
121
130
59
0.061
0.025
0.036
0.07
Lysine
114
74
101
0.055
0.115
0.072
0.095
Methionine
145
105
60
0.068
0.082
0.014
0.055
Phenylalanine
113
138
60
0.059
0.041
0.065
0.065
Proline
57
55
152
0.102
0.301
0.034
0.068
Serine
77
75
143
0.12
0.139
0.125
0.106
Threonine
83
119
96
0.086
0.108
0.065
0.079
108
137
96
0.077
0.013
0.064
0.167
69
147
114
0.082
0.065
0.114
0.125
106
170
50
0.062
0.048
0.028
0.053
Aspartic Acid
Tryptophan
Tyrosine
Valine
My T. Thai
[email protected]
91
Chou-Fasman
 A prediction is made for each type of structure
for each amino acid
 Can result in ambiguity if a region has high
propensities for both helix and sheet (higher value
usually chosen)
My T. Thai
[email protected]
92
Chou-Fasman
How it works:
1. Assign all of the residues the appropriate set of parameters
2. Identify -helix and -sheet regions. Extend the regions in
both directions.
3. If structures overlap compare average values for P(H) and
P(E) and assign secondary structure based on best scores.
4. Turns are calculated using 2 different probability values.
My T. Thai
[email protected]
93
Assign Pij values
1.
Assign all of the residues the appropriate
set of parameters
T
P(H)
69
P(E)
147
P(turn) 114
S
P
T
A
E
L
M
R
S
T
G
77
75
143
57
55
152
69
147
114
142
83
66
151
37
74
121
130
59
145
105
60
98
93
95
77
75
143
69
147
114
57
75
156
My T. Thai
[email protected]
94
Scan peptide for -helix regions
2. Identify regions where 4 out of 6 have a
P(H) >100 “alpha-helix nucleus”
P(H)
P(H)
T
S
P
T
A
E
L
M
R
S
T
G
69
77
57
69
142
151
121
145
98
77
69
57
T
S
P
T
A
E
L
M
R
S
T
G
69
77
57
69
142
151
121
145
98
77
69
57
My T. Thai
[email protected]
95
Extend -helix nucleus
3. Extend helix in both directions until a set of
four consecutive residues with P(H) <100.
P(H)
T
S
P
T
A
E
L
M
R
S
T
G
69
77
57
69
142
151
121
145
98
77
69
57
Find sum of P(H) and sum of P(E) in the extended region
If region is long enough ( >= 5 letters) and sum P(H) > sum
P(E) then declare the extended region as alpha helix
My T. Thai
[email protected]
96
Scan peptide for -sheet regions
4. Identify regions where 3 out of 5 have a
P(E) >100 “-sheet nucleus”
5. Extend -sheet until 4 continuous residues
with an average P(E) < 100
P(H)
P(E)
T
S
P
T
A
E
L
M
R
S
T
G
69
147
77
75
57
55
69
147
142
83
151
37
121
130
145
105
98
93
77
75
69
147
57
75
6. If region average > 100 and the average
P(E) > average P(H) then “-sheet”
My T. Thai
[email protected]
97
Overlapping
 Resolving overlapping alpha helix & beta sheet
Compute sum of P(H) and sum of P(E) in the
overlap.
If sum P(H) > sum P(E) => alpha helix
If sum P(E) > sum P(H) => beta sheet
My T. Thai
[email protected]
98
Turn Prediction
An amino acid is predicted as turn if all of the
following holds:
f(i)*f(i+1)*f(i+2)*f(i+3) > 0.000075
Avg(P(i+k)) > 100, for k=0, 1, 2, 3
Sum(P(t)) > Sum(P(H)) and Sum(P(E)) for i+k,
(k=0, 1, 2, 3)
My T. Thai
[email protected]
99
PSSP: Second Generation
Based on the information contained in a
window of amino acids (11-21 aa.)
The most systems use algorithms based on:







Statistical information
Physico-chemical properties
Sequence patterns
Graph-theory
Multivariante statistics
Expert rules
Nearest-neighbour algorithms
My T. Thai
[email protected]
100
PSSP: First & Second Generation
Main problems:
Prediction accuracy <70%
 SS assigments differ even between crystals of
the same protein
 SS formation is partially determined by longrange interactions, i.e., by contacts between
residues that are not visible by any method
based on windows of 11-21 adjacent residues
My T. Thai
[email protected]
101
PSSP: First & Second Generation
Main problems:
Prediction accuracy for -strand 28-48%,
only slightly better than random
 beta-sheet formation is determined by more
nonlocal contacts than in alpha-helix
formation
Predicted helices and strands are usually too
short
 Overlooked by most developers
My T. Thai
[email protected]
102
Example of Second Generation
 Example for typical secondary structure prediction of the 2nd
generation.
 The protein sequence (SEQ ) given was the SH3 structure.
 The observed secondary structure (OBS ) was assigned by DSSP (H
= helix; E = strand; blank = non-regular structure; the dashes
indicate the continuation).
 The typical prediction of too short segments (TYP ) poses the
following problems in practice.
 (i) Are the residues predicted to be strand in segments 1, 5, and 6
errors, or should the helices be elongated?
 (ii) Should the 2nd and 3rd strand be joined, or should one of them
be ignored, or does the prediction indicate two strands, here? Note:
the three-state per-residue accuracy is 60% for the prediction given.
My T. Thai
[email protected]
103
PSSP: Third Generation
 PHD: First algorithm in this generation (1994)
 Evolutionary information improves the prediction
accuracy to 72%
 Use of evolutionary information:
1. Scan a database with known sequences with alignment methods
for finding similar sequences
2. Filter the previous list with a threshold to identify the most
significant sequences
3. Build amino acid exchange profiles based on the probable
homologs (most significant sequences)
4. The profiles are used in the prediction, i.e. in building the
classifier
My T. Thai
[email protected]
104
PSSP: Third Generation
 Many of the second generation algorithms
have been updated to the third generation
My T. Thai
[email protected]
105
PSSP: Third Generation
 Due to the improvement of protein information
in databases i.e. better evolutionary
information, today’s predictive accuracy is
~80%
 It is believed that maximum reachable accuracy
is 88%. Why such conjecture?
My T. Thai
[email protected]
106
Why 88%
 SS assignments may vary for two versions of
the same structure
 Dynamic objects with some regions being more
mobile than others
 Assignment differ by 5-15% between different Xray (NMR) versions of the same protein
 Assignment diff. by about12% between structural
homologues
 B. Rost, C. Sander, and R. Schneider,
Redefining the goals of protein secondary
structure predictions, J. Mol. Bio.
My T. Thai
[email protected]
107
PSSP Data Preparation
 Public Protein Data Sets used in PSSP research contain
protein secondary structure sequences. In order to use
classification algorithms we must transform secondary
structure sequences into classification data tables.
 Records in the classification data tables are called, in
PSSP literature (learning) instances.
 The mechanism used in this transformation process is
called window.
 A window algorithm has a secondary structure as input
and returns a classification table: set of instances for the
classification algorithm.
My T. Thai
[email protected]
108
Window
 Consider a secondary structure (x, e).
where (x,e)= (x1x2 …xn, e1e2…en)
 Window of the length w chooses a subsequence of
length w of x1x2 …xn, and an element ei from
e1e2…en, corresponding to a special position in the
window, usually the middle
 Window moves along the sequences
x = x1x2 …xn and e= e1e2…en
simultaneously, starting at the beginning moving to the
right one letter at the time at each step of the process.
My T. Thai
[email protected]
109
Window: Sequence to Structure
 Such window is called sequence to structure
window. We will call it for short a window.
 The process terminates when the window or its
middle position reaches the end of the
sequence x.
 The pair: (subsequence, element of e ) is often
written in a form
 subsequence  H, E or L
is called an instance, or a rule.
My T. Thai
[email protected]
110
Example: Window
 Consider a secondary structure (x, e) and the
window of length 5 with the special position in
the middle (bold letters)
 Fist position of the window is:
x = A R N S T V V S T A A ….
e = HHHHLLL EEE
 Window returns instance:
 ARNST  H
My T. Thai
[email protected]
111
Example: Window
 Second position of the window is:
x = A R N S T V V S T A A ….
e = HHHHLLL EEE
 Windows returns instance:
R N S T V  H
 Next instances are:
N S T V V  L
S T V V S  L
T V V S T  L
My T. Thai
[email protected]
112
Symbolic Notation
Let f be a protein secondary structure (PSS)
identification function:
f  A* x S*
 Let x= x1x2…xn, e= e1e2…en, f(x)= e, we define
f(x1x2…xn)|{xi}= ei, i.e. f(x)|{xi}= ei
f : A* S*
i.e.
My T. Thai
[email protected]
113
Example:Semantics of Instances
Let
• x = A R N S T V V S T A A ….
•e = H H H H L L L E E E
And assume that the windows returns an instance:
AR N ST  H
•Semantics of the instance is:
f(x)|{N}=H,
where f is the identification function and N is preceded by
A R and followed by S T and the window has the length 5
My T. Thai
[email protected]
114
Classification Data Base (Table)
 We build the classification table with attributes
being the positions p1, p2, p3, p4, p5 .. pw
in the window, where w is length of the window.
The corresponding values of attributes are elements
of of the subsequent on the given position.
 Classification attribute is S with values in the set
{H, E, L} assigned by the window operation
(instance, rule).
 The classification table for our example (first few
records) is the following.
My T. Thai
[email protected]
115
Classification Table (Example)
 x = A R N S T V V S T A A ….
 e = HHHHLLL EEE
p1 p2 p3 p4 p5 S
A R N S T H
R
N
S
N
S
T
S T V H
T V V L
V V S L
Semantics of record r= r(p1, p2, p3,p4,p5, S) is : f(x)|{Vp3} =
Vs
where Va denotes a value of the attribute a.
My T. Thai
[email protected]
116
Size of classification datasets
(tables)
The window mechanism produces very
large datasets
For example window of size 13 applied to
the CB513 dataset of 513 protein subunits
produces about
70,000 records (instances)
My T. Thai
[email protected]
117
Window
 Window has the following parameters:
 PARAMETER 1 : i  N+, the starting point of
the window as it moves along the sequence x= x1
x2 …. xn. The value i=1 means that window
starts at x1, i=5 means that window starts at x5
 PARAMETER 2: w  N+ denotes the size
(length) of the window.
 For example: the PHD system of Rost and Sander
(1994) uses two window sizes: 13 and 17.
My T. Thai
[email protected]
118
Window
 PARAMETER 3: p  {1,2, …, w}
where p is a special position of the window
that returns the classification attribute values
from S ={H, E, L} and w is the size (length)
of the window
 PSSP PROBLEM:
find optimal size w, optimal special
position p for the best prediction
accuracy
My T. Thai
[email protected]
119
Window: Symbolic Definition
 Window Arguments: window parameters and
secondary structure (x,e)
 Window Value: (subsequence of x, element of e)
 OPERATION (sequence – to –structure window)
W is a partial function
W: N+  N+  {1,…, k} (A*  S* )  A*  S
W(i, k, p, (x,e)) = (xi x(i+1)…. x(i+k-1),
f(x)|{x(i+p)}) where (x,e)= (x1x2 ..xn, e1e2…en)
My T. Thai
[email protected]
120
Neural network models
 machine learning approach
 provide training sets of structures (e.g. -helices, non  helices)
 are trained to recognize patterns in known secondary
structures
 provide test set (proteins with known structures)
 accuracy ~ 70 –75%
My T. Thai
[email protected]
121
Reasons for improved accuracy
 Align sequence with other related proteins of the
same protein family
 Find members that has a known structure
 If significant matches between structure and
sequence assign secondary structures to
corresponding residues
My T. Thai
[email protected]
122
3 State Neural Network
My T. Thai
[email protected]
123
Neural Network
My T. Thai
[email protected]
124
Input Layer
 Most of approach set w = 17. Why?
 Based on evidence of statistical correlation with
secondary structure as far as 8 residues on either
side of the prediction point
 The input layer consists of:
 17 blocks, each represent a position of window
 Each block has 21 units:
 The first 20 units represent the 20 aa
 One to provide a null input used when the moving
window overlaps the amino- or carboxyl-terminal end
of the protein
My T. Thai
[email protected]
125
Binary Encoding Scheme
 Example:
 Let w = 5, and let say we have the sequence:
A E G K Q….
 Then the input layer is:
 A,C,D,E,F,G,…,N,P,Q,R,S.T,V,W,Y
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ….
00
0 0… 1 0 …..
0…
0 1 0 …..
My T. Thai
[email protected]
126
Hidden Layer
 Represent the structure of the central aa
 Encoding scheme:
 Can use two units to present:
 (1,0) = H, (0,1) = E, (0,0) = L
 Some uses three units:
 (1,0,0) = H, (0,1,0) = E, (0,0,1) = L
 For each connection, we can assign some
weight value.
 This weight value can be adjusted to best fit the
data (training)
My T. Thai
[email protected]
127
Output Level
 Based on the hidden level and some function f,
calculate the output.
 Helix is assigned to any group of 4 or more contiguous
residues
 Having helix output values greater than sheet outputs and
greater than some threshold t
 Strand (E) is assigned to any group of two or more
contiguous resides, having sheet output values greater
than helix outputs and greater than t
 Otherwise, assigned to L
 Note that t can be adjusted as well (training)
My T. Thai
[email protected]
128
How PHD works
Step 1. BLAST search with input sequence
Step 2. Perform multiple seq. alignment and
calculate aa frequencies for each position
Protein DSSP
K
E
L
N
D
L
E
K
Y
N







aligned sequence Pos.
K-HK
1
EDAE
2
FFFF
3
SAAS
4
QKKQ
5
LLLL
6
EEEE
7
KEKK
8
FFYF
9
DDND
My T. Thai10
profile generation
K=0.75, H=0.25
E=0.6, D=0.2, A=0.2
L=0.2, F=0.8
N=0.2, S=0.4, A=0.4
K=0.4,Q=0.4 D=0.2
L=1.0
E=1.0
K=0.2, E=0.2
Y=0.4, F=0.6
D=0.6, N=0.4
[email protected]
129
How PHD works
Step 3. First Level: “Sequence to structure net”
Input: alignment profile, Output: units for H, E, L
Calculate “occurrences” of any of the residues to be
present in either an a-helix, b-strand, or loop.
N=0.2, S=0.4, A=0.4
1
2
3
4
5
6
7
H = 0.05
E = 0.18
L= 0.67
My T. Thai
[email protected]
130
How PHD works
Step 3. Second Level: “Structure to structure net”
Input: First Level values, Output: units for H, E, L
Window size = 17
H = 0.59
E = 0.09
L= 0.31
E=0.18
Step 4. Decision level
My T. Thai
[email protected]
131
Prepare Data for PHD Neural Nets

Starting from a sequence of
unknown structure
(SEQUENCE ) the following
steps are required to finally
feed evolutionary information
into the PHD neural
networks:
1.
2.
3.
4.

My T. Thai
[email protected]
a data base search for
homologues (method Blast),
a refined profile-based dynamicprogramming alignment of the
most likely homologues (method
MaxHom)
a decision for which proteins
will be considered as
homologues (length-depend cutoff for pairwise sequence
identity)
a final refinement, and
extraction of the resulting
multiple alignment. Numbers 13 indicate the points where users
of the PredictProtein service can
interfere to improve prediction
accuracy without changes made
to the final prediction method
PHD .
http://cubic.bioc.columbia.edu/papers/2
000_rev_humana/paper.html
132
PHD Neural Network
My T. Thai
[email protected]
133
Prediction Accuracy
Authors
Chou-Fasman
Garnier
Levin
Rost & Sander
Year % acurracy
Method
1974
50%
propensities of aa's in 2nd structures
1978
62%
interactions between aa's
1993
69%
multiple seq. alignments (MSA)
1994
72%
neural networks + MSA
My T. Thai
[email protected]
134
Where can I learn more?
Protein Structure Prediction Center
Biology and Biotechnology Research Program
Lawrence Livermore National Laboratory, Livermore, CA
http://predictioncenter.llnl.gov/Center.html
DSSP
Database of Secondary Structure Prediction
http://www.sander.ebi.ac.uk/dssp/
My T. Thai
[email protected]
135
Computational Molecular Biology
Protein Structure: Tertiary Prediction via
Threading
Objective
 Study the problem of predicting the tertiary
structure of a given protein sequence
My T. Thai
[email protected]
137
A Few Examples
actual
actual
predicted
predicted My T. Thai
[email protected]
actual
actual
predicted
predicted
138
Two Comparative Modeling
 Homology modeling – identification of homologous proteins
through sequence alignment; structure prediction through
placing residues into “corresponding” positions of
homologous structure models
 Protein threading – make structure prediction through
identification of “good” sequence-structure fit
 We will focus on the Protein Threading.
My T. Thai
[email protected]
139
Why it Works?
 Observations:
 Many protein structures in the PDB are very similar
 Eg: many 4-helical bundles, globins… in the set of
solved structure
 Conjecture:
 There are only a limited number of “unique” protein
folds in nature
My T. Thai
[email protected]
140
Threading Method
 General Idea:
 Try to determine the structure of a new sequence by
finding its best ‘fit’ to some fold in library of
structures
 Sequence-Structure Alignment Problem:
 Given a solved structure T for a sequence t1t2…tn
and a new sequence S = s1s2… sm, we need to find
the “best match” between S and T
My T. Thai
[email protected]
141
What to Consider
 How to evaluate (score) a given alignment of s
with a structure T?
 How to efficiently search over all possible
alignments?
My T. Thai
[email protected]
142
Three Main Approaches
 Protein Sequence Alignment
 3D Profile Method
 Contact Potentials
My T. Thai
[email protected]
143
Protein Sequence Alignment Method
 Align two sequences S and T
 If in the alignment, si aligns with tj, assign si to the
position pj in the structure
 Advantages:
 Simple
 Disadvantages:
 Similar structures have lots of sequence variability, thus
sequence alignment may not be very helpful
My T. Thai
[email protected]
144
3D Profile Method
 Actually uses structural information
 Main idea:
 Reduce the 3D structure to a 1D string describing
the environment of each position in the protein.
(called the 3D profile (of the fold))
 To determine if a new sequence S belongs to a
given fold T, we align the sequence with the fold’s
3D profile
 First question: How to create the 3D profile?
My T. Thai
[email protected]
145
Create the 3D Profile
 For a given fold, do:
1. For each residue, determine:
 How buried is it?
 Fraction of surrounding environment that is polar
 What secondary structure is it in (alpha-helix, betasheet, or neither)
My T. Thai
[email protected]
146
Create the 3D profile
2. Assign an environment class to each position:
Six classes describe the burial and polarity criteria
(exposed, partially buried, very buried, different
fractions of polar environment)
My T. Thai
[email protected]
147
Create the 3D Profile
 These environment classes depend on the
number of surrounding polar residues and how
buried the position is.
 There are 3 SS for each of these, thus have 18
environment classes
My T. Thai
[email protected]
148
Create the 3D Profile
3. Convert the known structure T to a string of
environment descriptors:
4. Align the new sequence S with E using dynamic
programming
My T. Thai
[email protected]
149
Scores for Alignment
 Need scores for aligning individual residues
with environments.
 Key: Different aa prefer diff. environment.
Thus determine scores by looking at the
statistical data
My T. Thai
[email protected]
150
Scores for Alignment
1. Choose a database of known structures
2. Tabulate the number of times we see a particular
residue in a particular environment class -> compute
the score for each env class and each aa pair
3. Choose gap penalties, eg. may charge more for gaps in
alpha and beta environments…
My T. Thai
[email protected]
151
Alignment
 This gives us a table of scores for aligning an aa
sequence with an environment string
 Using this scoring and Dynamic Programming, we can
find an optimal alignment and score for each fold in our
library
 The fold with the highest score is the best fold for the
new sequence
152
My T. Thai
[email protected]
Contact Potentials Method
 Take 3D structure into account more carefully
 Include information about how residues interact with
each other
 Consider pairwise interactions between the position pi, pj in
the fold
 For a given alignment, produce a score which is the sum over
these interactions:
My T. Thai
[email protected]
153
Problem
 Have a sequence from the database T = t1…tn with
known positions p1…pn, and a new sequence S =
s1…sm.
 Find 1 <= r1 < r2 < … < rn < m which maximize
where ri is the index of the aa in S which occupies
position pi
 This problem is NP-complete for pairwise interactions
My T. Thai
[email protected]
154
How to Define that Score?
 Use so-called “knowledge-based potentials”, which
comes from databases of observed interactions.
 The general form:
My T. Thai
[email protected]
155
How to Define the Score
 General Idea:
 Define cutoff parameter for “contact” (e.g. up to 6
Angstroms)
 Use the PDB to count up the number of times aa i
and j are in contact
 Several method for normalization. Eg.
Normalization is by hypothetical random
frequencies
My T. Thai
[email protected]
156
Other Variations
 Many other variations in defining the potentials
 In addition to pairwise potentials, consider
single residue potentials
 Distance-dependent intervals:
 Counting up pairwise contacts separately for
intervals within 1 Angstrom, between 1 and 2
Angstroms…
My T. Thai
[email protected]
157
Threading via Tree-Decomposition
My T. Thai
[email protected]
158
Contact Graph
1.
2.
template
3.
My T. Thai
[email protected]
Each residue as a vertex
One edge between two
residues if their spatial
distance is within given
cutoff.
Cores are the most
conserved segments in the
template
159
Simplified Contact Graph
My T. Thai
[email protected]
160
Alignment Example
My T. Thai
[email protected]
161
Alignment Example
My T. Thai
[email protected]
162
Calculation of Alignment Score
My T. Thai
[email protected]
163
Graph Labeling Problem
h
b
d
c
a
f

Each core as a vertex

Two cores interact if there is
an interaction between any
two residues, each in one
core

Add one edge between two
cores that interact.
s
m
e
i
j
k
l
Each possible sequence alignment position for a single core
can be treated as a possible label assignment to a vertex in G
D[i] = be a set of all possible label assignments to vertex i.
Then for each label assignment A(i) in D[i], we have:
My T. Thai
[email protected]
164
Tree Decomposition
My T. Thai
[email protected]
165
Tree Decomposition
[Robertson & Seymour, 1986]
Greedy: minimum degree heuristic
b
f
d
c
h
c
e
l
1.
2.
3.
4.
5.
k
i
f
d
abd
g
m
a
h
m
a
e
l
j
g
k
i
j
Choose the vertex with minimum degree
The chosen vertex and its neighbors form a component
Add one edge to any two neighbors of the chosen vertex
Remove the chosen vertex
Repeat the above steps until the graph is empty
My T. Thai
[email protected]
166
Tree Decomposition (Cont’d)
h
b
c
g
m
a
e
l
Tree Decomposition
f
d
k
fg
h
abd
i
acd
cdem
defm
clk
j
eij
remove dem
ab
My T. Thai
[email protected]
ac
clk
c
fg
h
f
ij
167
Tree Decomposition-Based Algorithms
Xir
Xr
Xi
Xp
Xq
1. Bottom-to-Top: Calculate the
minimal F function
Xji
Xli
Xj
2. Top-to-Bottom: Extract the
optimal assignment
Xl
A tree decomposition rooted at Xr
F ( X i , A( X ir )) 
min F ( X
A( X i - X r )
The score of subtree rooted at Xi
j
The score of component Xi
, A( X ji ))  F ( X l , A( X li ))  Score( X i , A( X i ))
The scores of subtree rooted at Xl
The scores of subtree rooted at Xj
My T. Thai
[email protected]
168