Transcript Document

Basics of protein structure and modeling
Rui Alves
Proteins are the primary functional
manifestation of genomes
DNA sequence
transcription
atgcaaactctttctgaacgcctcaagaagaggcgaattgcgttaaaaatgacgcaaaccgaa
ctggcaaccaaagccggtgttaaacagcaatcaattcaactgattgaagctggagtaaccaa
gcgaccgcgcttcttgtttgagattgctatggcgcttaactgtgatccggtttggttacagtacgg
aactaaacgcggtaaagccgcttaa
Being able
to predict the protein sequence
augcaaacucuuucugaacgccucaagaagaggcgaauugcguuaaaaaugacgcaaacc
gaacuggcaaccaaagccgguguuaaacagcaaucaauucaacugauugaagcuggagua
RNA sequence
from the gene
sequence allows us to
accaagcgaccgcgcuucuuguuugagauugcuauggcgcuuaacugugauccgguuug
guuacaguacggaacuaaacgcgguaaagccgcuuaa
translation
predict structure, which in turn helps us
MQTLSERLKKRRIALKMTQTELATKAGVKQQSIQLIEAGVT
protein understand
KRPRFLFEIAMALNCDPVWLQYGTKRGKAA
how the protein does what it
sequence
does
protein
structure
Protein function
Outline
• DNA sequence to protein sequence
• From protein sequence to secondary structure
• Protein tertiary structure
• Predicting protein structure
Predicting protein sequence from
DNA sequence
• Protein sequence can be predicted by
translating the cDNA and using the genetic
code.
Translating cDNA to protein
ATGTCTCTTATATGA…
No Gene!!!!!
MetSer Leu Ile Ter
Translating cDNA to Protein
Translating yeast mitochondrial
cDNA into protein sequence
ATGTCTCTTATATGA………SECIS sequence
There is a Gene with a considerably different
MetSer Thr Met sCys
protein sequence from the one we would
predict Leu
fromIle
the Ter
universal genetic code!!!!!
MetSer
Outline
• DNA sequence to protein sequence
• From protein sequence to secondary structure
• Protein tertiary structure
• Predicting protein structure
Amino acids are the primary building
blocks of proteins
•
•
•
•
The sequence of AAs is the primary structure of proteins
Sequence determines structure
Amino acids don’t fall neatly into classes
How we casually speak of them can affect the way we
think about their behavior. For example, if you think of
Cys as a polar residue, you might be surprised to find it in
the hydrophobic core of a protein unpaired to any other
polar group. But this does happen.
• The properties of a residue type can also vary with
conditions/environment
Grouping the amino acids by properties
Livingstone & Barton, CABIOS, 9, 745-756, 1993.
Proteins are made by controlled
polymerization of amino acids
wa te r is e limina te d
O
two a mino a cids
conde ns e to form...
H2 N
CH
O
C
OH
H2 N
CH
R1
OH
R2
N or a mino
te rminus
H2 N
...a dipe ptide . If
the re a re more it
be come s a polype ptide .
S hort polype ptide cha ins
a re us ua lly ca lle d pe ptide s
while longe r one s a re ca lle d
prote ins .
C
O
CH
C
R1
O
NH
CH
C
R2
pe ptide bond is forme d
re s idue 1
re s idue 2
C or ca rboxy
te rminus
OH
+ HOH
Repeating torsion angles
f/y angles characterize the secondary structure
Secondary structure elements in proteins
Reflect the tendency of backbone to
hydrogen bond with itself in a semi-ordered
fashion when compacted
beta-strand
(nonlocal interactions)
A secondary structure element is a
contiguous region of a protein sequence
characterized by a repeating pattern of
main-chain hydrogen bonds and backbone
phi/psi angles
alpha-helix
(local interactions)
Principal types of secondary structure
found in proteins
Repeating (f,y) values
f
a-helix
(15)
y
-63o
-42o
310 helix
(14)
-57o
-30o
Parallel b-sheet
-119o
+113o
Antiparallel b-sheet
-139o
+135o
(right-handed)
The alpha-helix: repeating i,i+4 h-bonds
11
180
120
10
12
9
right-handed helical region
of phi-psi space
60
8
0
7
-60
5
hydrogen
6
a-helix
(15)
3
4
f
-120
bond
(right-handed)
2
-63o
-180
1
y
-180
-120
-42o
-60
0
60
120
By DSSP definitions, which of
residues 1-12 are in the helix? Does
this coincide with the residues in the
helical region of phi-psi space?
b strands/sheets
beta-strand region of phipsi space
180
57
56
Parallel b-sheet
120
f
60
-119o
0
54
y
+113o
-60
53
-120
52
-180
-180
51
50
49
-120
-60
0
60
120
180
Is this a parallel or anti-parallel sheet?
By DSSP definitions, which of res 49-57 are in the
sheet? Does this coincide with the residues in the
beta-strand region of phi-psi space?
Contact maps of protein structures
-both axes are the sequence of the protein
map of Ca-Ca distances < 6 Å
near diagonal: local
contacts in the sequence
off-diagonal: long-range
(nonlocal) contacts
rainbow ribbon diagram
blue to red: N to C
1avg--structure of triabin
What does secondary structure teach
• If, from the primary structure one can
predict secondary structure, then this may
help in predicting protein function, via
evolutionary relationships with known folds
Outline
• DNA sequence to protein sequence
• From protein sequence to secondary structure
• Protein tertiary structure
• Predicting protein structure
Tertiary structure in proteins
• Single polypeptide chain
• The number and order of secondary structures in the
sequence (connectivity) and their arrangement in space
defines a protein’s fold or topology
• Pattern of contacts between side chains/backbone also an
aspect of tertiary structure
• Outer surface and interior
Obvious interactions in native protein
structures
hydrophobic
interactions
R
R1 2
NH
disulfide crosslinks
S
R3
O
S
NH3
CO2
polar interactions
(hydrogen
bond/salt bridge)
The protein databank
The protein databank is a central repository of protein
structures
http://www.rcsb.org/pdb/home/home.do
Major structure classification systems
SCOP (Structural Classification of Proteins)
CATH (Class-Architecture-Topology-Homology)
DALI/FSSP (Fold classification based on StructureStructure Alignment)
SCOP and CATH are quite similar and generally combine automated and manual
aspects. They are both “curated” by human experts.
Outline
• DNA sequence to protein sequence
• From protein sequence to secondary structure
• Protein tertiary structure
• Predicting protein structure
The knuts and bolts behind fold
predition
a-helix coil
Database
Training
Test
set of
set
of
of known
known
structures
b-strand
ACDEFGTYAEE…
…
Database
Training
Test
set of
set
of of
corresponding
sequences
Predict 2ary structure
Compare
Good
Bad Predictions:
Predictions:
ndary repeat
Reshuffle
Method ready
training
for new
set and
sequence
test set2and
structure
until
p(aa1-coil)
predictions
prediction
are correct
p(aa1-helix)
p(aa1-strand)
p(a-helix)
p(coil)
p(b-strand)
…
p(a-helix) p(coil) p(b-strand)
A
A
0.23
A…C…
A…C..
A…C…
0.1…0.03
0.04…0.002
0.1…0.21
0.28
0.5
How does a fold prediction server work?
Fold Prediction
…
Database of
known
structures
Database of
corresponding
sequences
Database of
probabilities of aa
in 2ndary structure
YOUR SEQUENCE
Homology
based helix
coil-strand
profile folds
database
Server
Strong
Homology
Helix-coil-strand
profile prediction
Weak/No
Homology
Fold Prediction
…
Predicting protein folding
Predicting protein structure
• Homology Modeling
– 3D-JIGSAW, SWISSMODEL
• Ab initio Modeling
– ROBETTA
Predicting protein structure by
homology
How does a homology modeling server
work?
…YDVRSEQVENCE…
…
Optimization via energy
minimization, etc…
Server/
Program
…
Database of
corresponding
sequences
Strong
Homologues
Database of
Thread sequence to
known
predict over known
structures
structure according to
alignment
Best
possible
alignment
…YDVR-SEQVENCE…
…YDVRMSD-VDNCD…
…
(Sequence+
…YDVR-SEQVENCE…
…YDVRMSD-VDNCD…
Structure)
Predicting protein structure
• Homology Modeling
– 3D-JIGSAW,SWISSMODEL
• Ab initio Modeling
– ROSETTA
Predicting protein structure by ab
initio methods
…YDVRSEQVENCE…
Database of
structures for
smaller amino
acid runs
Server/
Program
Database of
corresponding
sequences
…VENCE…
…YDNCD…
…VENCE…
…VEQCE…
NO
Homologues
…
…YDVR-SEQ
…YDVRMSD-…
…YDVR-SEQ
…YPVRMSD-…
…
…
…
Assemble
Energy
minimization
&
optimization
Accuracy of modelling
• Accuracy is widely varying.
• The quality of the model is VERY dependent on
the quality of the alignment
• Globular proteins are more accurately predicted
• Membrane proteins are still a big problem
• Homology modelling is “bad” if Homology<30%
• CASP is a bienial meeting where accuracy of the
different methods is predicted
– Baker group is usually and consistently more accurate
than others
http://www.predictioncenter.org/
Summary
• DNA sequence to protein sequence
• From protein sequence to secondary structure
• Protein tertiary structure
• Predicting protein structure
“Accessible Surface”
represent atoms as spheres w/appropriate
radii and eliminate overlapping parts...
mathematically roll a
sphere all around that
surface...
the sphere’s
center traces
out a surface
as it rolls...
Lee & Richards, 1971
Shrake & Rupley, 1973
The outer surface: water in protein
structures
Structures of water-soluble
proteins determined at
reasonably high resolution will
be decorated on their outer
surfaces with water molecules
(cyan balls) with relatively welldefined positions, and waters
may also occur internally
Water is not just surrounding the
protein--it is interacting with it
Water interacts with protein surfaces
most waters visible in structures make hydrogen bonds to each other
and/or to the protein, as donor/acceptor/both
second shell water:
only contacts other waters
first shell waters:
in contact with/
hydrogen bound
to protein
Side chain conformation
• side chains differ in
their number of degrees
of conformational freedom
(some don’t have any, such
as Ala and Gly)
•but side chains of very
different size can have
the same number of c
angles.
Supersecondary structures/structural
motifs
•
just as there are certain secondary structure elements that are common, there are also
particular arrangements of multiple secondary structure elements that are common
•
supersecondary structures emphasize issue of topology in protein structure
b-a-b motif
greek key motif
Topology: differences in connectivity
• example:
a four-stranded antiparallel b sheet can have
many different topologies based on the order in which the
four b strands are connected:
“up-and-down”
“greek key”
Topology: differences in handedness
•
•
example: An extremely common supersecondary structure in proteins is the
beta-alpha-beta motif, in which two adjacent beta-strands are arranged in
parallel and are separated in the sequence by a helix which packs against them.
if the two parallel strands are oriented to face toward you, the helix can be
either above or below the plane of the strands.
huge preference for right-handed
arrangement in proteins
DIY: The sequence
DIY: The server
DIY: The reply
DIY: fine tuning
DIY: That is it!
The CATH Hierarchy
1. Divide PDB structure entries into domains (using domain recognition
algorithms--domain is the fundamental unit of structure classification
2. Classify each domain according to a five level hierarchy:
Class
Architecture
Topology
Homologous Superfamily
Sequence Family
There is no purely phyletic
system of protein classification!
(also unlikely that there is any
common ancestor to all proteins)
the top 3 levels of the hierarchy
are purely phenetic--based
on characteristics of the structure,
not on evolutionary relationships
the bottom two levels include
some phyletic classification as well-groupings according to putative
common ancestry based on
structural similarity, functional
similarity, and sequence similarity
SCOP: A different (but similar) taxonomy system
Correspondences between SCOP and CATH hierarchies:
SCOP
class
fold
superfamily
family
domain
CATH
class
architecture
topology
homologous superfamily
sequence family
domain
CATH more directed toward structural classification, whereas SCOP
pays more attention to evolutionary relationships. Both have in common that
they have manual aspects and are curated by experts.
Internal interactions in a protein
Amino acids: the building blocks of
proteins
a lpha ca rbon
O
H2 N
CH
C
O
OH
H3 N
R
a mino group
CH
C
O
R
ca rboxylic a cid
group
The zwitte rionic form is
the pre domina nt form a t
ne utra l pH
s ide cha in
O
C
H3 N
O
C
H
R
The a lpha ca rbon is
a chira l ce nte r--na tura l
prote ins a re ma de of
L a mino a cids (s hown
a bove ) a s oppos e d to
D