lecture08_08

Download Report

Transcript lecture08_08

Structural Bioinformatics
Proteins
Structure Prediction Motivation
• Understand protein function
– Locate binding sites
• Broaden homology
– Detect similar function where sequence differs
(only ~50% remote homologies can be detected
based on sequence)
• Explain disease
– See effect of amino acid changes
– Design suitable compensatory drugs
2
Myoglobin – the first high resolution protein structure
Solved in 1958 by Max Perutz John Kendrew of Cambridge University.
Won the 1962 and Nobel Prize in Chemistry.
“ Perhaps the most remarkable features of the molecule are its
complexity and its lack of symmetry. The arrangement seems to
be almost totally lacking in the kind of regularities which one
instinctively anticipates.”
3
From the structure we can get information
about the secondary and tertiary structure of
the protein
What are Secondary Structures ??
4
Secondary Structure
Secondary structure is usually divided into
three categories:
Alpha helix
Beta strand (sheet)
Anything else –
turn/loop
5
Alpha Helix: Pauling (1951)
• A consecutive stretch of 5-40 amino
acids (average 10).
• A right-handed spiral conformation.
• 3.6 amino acids per turn.
3.6
residues
5.6 Å
• Stabilized by H-bonds
6
Beta Strand: Pauling and Corey (1951)
•
Different polypeptide chains run alongside each
other and are linked together by hydrogen bonds.
• Each section is called β -strand,
and consists of 5-10 amino acids.
β -strand
7
3.47Å
Beta Sheet
4.6Å
The strands become
adjacent to each other,
forming beta-sheet.
3.25Å
Antiparallel
Parallel
4.6Å
8
Loops
• Connect the secondary
structure elements.
• Have various length and
shapes.
• Located at the surface of
the folded protein and
therefore may have
important role in
biological recognition
processes.
9
Tertiary Structure
Describes the packing of alpha-helices, beta-sheets
and random coils with respect to each other on the
level of one whole polypeptide chain
10
How does the structure relate to the
primary protein sequence??
11
SEQUENCE
STRUCTURE
FUNCTION
Each protein has a particular 3D structure that
determines its function
Early experiments have shown that the sequence of the
protein is sufficient to determine its structure
Protein structure is more conserved than protein
sequence , and more closely related to function.
Homologous proteins are of the same evolutionary origin.
Despite the differences which have been accumulated in
their sequences, the structure and function of these proteins
can be remarkably conserved.
12
How (CAN) Different Amino Acid
Sequence Determine Similar Protein
Structure ??
Lesk and Chothia 1980
13
The Globin Family
14
Different sequences can result in similar structures
1ecd
2hhd
15
We can learn about the important features
which determine structure and function by
comparing the sequences and structures ?
16
The Globin Family
17
Why is Proline 36 conserved in all the globin family ?
18
Where are the gaps??
The gaps in the pairwise alignment are mapped to the loop regions
19
How are remote homologs related in terms of their structure?
RBD
retinol-binding
protein
apolipoprotein D
b-lactoglobulin
odorant-binding
protein
20
PSI-BLAST alignment of RBP and b-lactoglobulin: iteration 3
Score = 159 bits (404), Expect = 1e-38
Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%)
Query: 3
Sbjct: 1
Query: 55
Sbjct: 60
WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54
V L+ LA A
+ S V+ENFD ++ G WY + K
MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59
DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114
+ I A +S+ E G +
K
V +
++ +PAK +++++ +
NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112
Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164
+WI+ TDY+ YA+ YSC
+ ++ R+P LPPE
Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159
21
The Retinol Binding Protein
b-lactoglobulin
22
So how can we obtain the structure
information ???
23
PDB: Protein Data Bank
• DataBase of molecular structures :
Protein, Nucleic Acids (DNA and RNA),
• Structures solved by
X-ray crystallography
NMR
Electron microscopy
24
RCSB PDB – Protein Data Bank
http://www.rcsb.org/pdb/
25
How Many Structures ?
March 2008 – 49295 Structures
26
Structure Prediction: Motivation
• Hundreds of thousands of gene sequences
translated to proteins (genbanbk, SW, PIR)
• Only about ~40000 solved protein structures
• Experimental methods are time consuming and
not always possible
• Goal: Predict protein structure based
on sequence information
Prediction Approaches
• Primary (sequence) to secondary structure
– Sequence characteristics
• Secondary to tertiary structure
– Fold recognition
– Threading against known structures
• Primary to tertiary structure
– Ab initio modelling
28
Secondary Structure Prediction
• Given a primary sequence
ADSGHYRFASGFTYKKMNCTEAA
what secondary structure will it adopt ?
29
RBP
RBP (Retinol Binding Protein)
Globin
30
According to the most simplified model:
• In a first step, the secondary structure is
predicted based on the sequence.
• The secondary structure elements are then
arranged to produce the tertiary structure,
i.e. the structure of a protein chain.
• For molecules which are composed of
different subunits, the protein chains are
arranged to form the quaternary structure.
31
Secondary Structure Prediction
Methods
• Chou-Fasman / GOR Method
– Based on amino acid frequencies
• Machine learning methods
– PHDsec and PSIpred
• HMM (Hidden Markov Model)
• Best accuracy nowadays ~80%
32
Chou and Fasman (1974)
The propensity of an amino
acid to be part of a certain
secondary structure (e.g. –
Proline has a low
propensity of being in an
alpha helix or beta sheet 
breaker)
Name
Alanine
Arginine
Aspartic Acid
Asparagine
Cysteine
Glutamic Acid
Glutamine
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
P(a)
142
98
101
67
70
151
111
57
100
108
121
114
145
113
57
77
83
108
69
106
P(b)
83
93
54
89
119
037
110
75
87
160
130
74
105
138
55
75
119
137
147
170
P(turn)
66
95
146
156
119
74
98
156
95
47
59
101
60
60
152
143
96
96
114
50
Success rate of 50%
33
Secondary Structure Method
Improvements
‘Sliding window’ approach
• Most alpha helices are ~12 residues long
Most beta strands are ~6 residues long
 Look at all windows of size 6/12
 Calculate a score for each window. If >threshold
 predict this is an alpha helix/beta sheet
TGTAGPOLKCHIQWMLPLKK
34
Improvements since 1980’s
• Adding information from conservation in
MSA
• Smarter algorithms (e.g. HMM, neural
networks).
Success -> 75%-80%
35
PHDsec and PSIpred
• PHDsec
– Rost & Sander, 1993
– Based on sequence family alignments (MaxHom)
• PSIpred
– Jones, 1999
– Based on Position Specific Scoring Matrix
Generated by PSI-BLAST
• Both consider long-range interactions
36
How does secondary structure prediction
work?
Query
SwissProt
Query
Subject
Subject
Subject
Subject
Step 1:
Generating a multiple sequence
alignment
37
Steps in secondary structure prediction:
Query
Step 2:
Additional sequences are added using a
profile:
– A PSI-BLAST PSSM.
– A conservation profile (MaxHom).
seed
We end up with a MSA which represents the
protein family.
MSA
Query
Subject
Subject
Subject
Subject
38
Steps in secondary structure prediction:
Query
The sequence profile of the protein family
is compared (by machine learning
methods) to sequences with known
secondary structure.
seed
MSA
Query
Subject
Subject
Subject
Subject
Machine
Learning
Approach
Known
structures
39
SS prediction using Neural Network
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
.
Sequence
Profile
40
PHDsec Neural Net
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
.
Hidden layer
(known ss)
Output
prediction
H= helix
E= strand
C= Coil
Confidence
0=low,9=high
41
HMM
• HMM enables us to calculate the
probability of assigning a sequence of
hidden states to the observation
TGTAGPOLKCHIQWML
HHHHHHHLLLLBBBBB
observation
p=?
Hidden state
(known ss)
42
Beginning
with an αhelix
α-helix
followed by
α-helix
The
probability of
observing
Alanine as
part of a βsheet
The probability of observing a residue which belongs to an
α-helix followed by a residue belonging to a turn = 0.15
Table built according to large database of known secondary
structures
43
HMM
• The above table enables us to calculate the
probability of assigning secondary structure
to a protein
• Example
TGQ
HHH
p = 0.45 x 0.041 x 0.8 x 0.028 x 0.8 x 0.0635 =
0.0020995
44
Secondary structure prediction
•
•
•
•
•
•
•
•
•
•
•
•
•
•
AGADIR - An algorithm to predict the helical content of peptides
APSSP - Advanced Protein Secondary Structure Prediction Server
GOR - Garnier et al, 1996
HNN - Hierarchical Neural Network method (Guermeur, 1997)
Jpred - A consensus method for protein secondary structure prediction at University
of Dundee
JUFO - Protein secondary structure prediction from sequence (neural network)
nnPredict - University of California at San Francisco (UCSF)
PredictProtein - PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader, MaxHom,
EvalSec from Columbia University
Prof - Cascaded Multiple Classifiers for Secondary Structure Prediction
PSA - BioMolecular Engineering Research Center (BMERC) / Boston
PSIpred - Various protein structure prediction methods at Brunel University
SOPMA - Geourjon and Del‫י‬age, 1995
SSpro - Secondary structure prediction using bidirectional recurrent neural networks
at University of California
DLP - Domain linker prediction at RIKEN
45