2 -1 -2 -1 1 2 K

Download Report

Transcript 2 -1 -2 -1 1 2 K

Chap. 11 Protein Structures
Amino Acid


General structure of amino acids
 an amino group
 a carboxyl group
 α-carbon bonded to a hydrogen
and a side-chain group, R
Side chain R determines the identity
of particular amino acid
•
•
•
•
•
R: large white and
gray
C: black
Nitrogen: blue
Oxygen: red
Hydrogen: white
Protein



Protein: polymer consisting of AA’s linked by peptide bonds
 AA in a polymer is called a residue
Folded into 3D structures
Structure of protein determines its function
 Primary structure: linear arrangement of AA’s




AA sequence (primary structure) determines 3D structure of a
protein, which in turn determines its properties
N- and C-terminal
Secondary structure: short stretches of AAs
Tertiary structure: overall 3D structure
Protein Structures
Secondary structure



Secondary structures have repetitive interactions resulting from
hydrogen bonding between N-H and carboxyl groups of peptide
backbone
Conformations of side chains of AA are not part of the secondary
structure
α-helix
Secondary structure

β-pleated sheet
 Parallel/antiparallel

3D form of antiparallel
Secondary structure: domain
Part of chain folds independently of foldings of
other parts
• Such independent folded portion of protein is
called domain (super-secondary structure)

(a)
(b)
(c)
(d)
 α  unit
α α unit (helix-turn-helix)
 meander
Greek key
Domain


Larger proteins are modular
 Their structural units, domains or folds, can be covalently linked to
generate multi-domain proteins
 Domains are not only structurally, but also functionally, discrete
units – domain family members are structurally and functionally
conserved and recombined in complex ways during evolution
 Domains can be seen as the units of evolution
 Novelty in protein function often arises as a result of gain or loss of
domains, or by re-shuffling existing domains along sequence
 Pairs of protein domains with the same 3D fold, precise function is
conserved to ~40% sequence identity (broad functional class is
conserved ~20%)
DNA binding domains
 http://en.wikipedia.org/wiki/DNA-binding_domain
Motif



A short, conserved regions (frequently the most conserved
regions of a domain)
Critical for the domain to function
Domain vs. Motif
 Motif are structural characteristics
 Domains are functional regions, usually consisting of a few
motifs
Motif Representation

Motif
 In multiple alignments of
distinctly related
sequences, highly
conserved regions are
called motifs, features,
signatures or blocks
 Tends to correspond to core
structural and functional
elements of the proteins
Motif

Greek key motif is often found in –barrel tertiary structure
(a)
(b)
(c)
(d)
(e)
complement control
protein module
Immunoglobulin module
Fibronectin type I
module
Growth factor module
Kringle module
(a)
(b)
(c)
(d)
Linked series of -meanders
Greek key pattern
Alternative  α  untis
Top and side views (α-helical
section is outside)
Secondary structure: conformation

Two types of Protein Conformations
 Fibrous
 Globular –folds back onto itself to create a spherical
shape
(a)
(b)
Schematic diagrams of fibrous and globular
proteins
Computer-generated model of globular
protein
Secondary Structure Prediction


Ab initio prediction (from AA sequence)
 Still an open problem
1974 Peter Chou and Gerald Fasman




Use known structures to determine which AA contributes to
each secondary structure
Propensity values : likelihood that an AA appears in a particular
structure
 P(a), P(b) and P(turn)
 >1 indicates a greater than average chance (log-odd ratios)
Frequency values: frequency of an AA being found in a hairpin
 Four positions in a hairpin beta-turn
Accuracy is around 50-60%, but popular due to its foundation
for later prediction programs
AA
P(a)
P(b)
P(turn) f(i)
f(i+1)
f(i+2)
f(i+3)
Alanine
Arginine
Asparagine
Aspartic acid
Cysteine
Glutamic acid
Glutamine
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Pheylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
142
98
67
101
70
151
111
57
100
108
121
114
145
113
57
77
83
108
69
104
83
93
89
54
119
37
110
75
87
160
130
74
105
138
55
75
119
137
147
170
66
95
95
146
119
74
98
156
95
47
59
101
60
60
152
143
96
96
114
50
0.076
0.106
0.083
0.110
0.050
0.060
0.098
0.085
0.047
0.034
0.025
0.115
0.082
0.041
0.301
0.139
0.108
0.013
0.065
0.048
0.035
0.099
0.191
0.179
0.117
0.077
0.037
0.190
0.093
0.013
0.036
0.072
0.014
0.065
0.034
0.125
0.065
0.064
0.114
0.028
0.058
0.085
0.091
0.081
0.128
0.064
0.098
0.152
0.054
0.054
0.070
0.095
0.055
0.065
0.068
0.106
0.079
0.167
0.125
0.053
0.060
0.070
0.161
0.147
0.149
0.056
0.074
0.102
0.140
0.043
0.061
0.055
0.068
0.059
0.102
0.120
0.086
0.077
0.082
0.062
Chou-Fasman Algorithm


Step 1: identify alpha-helices
 Find a region of six contiguous residues where at least four
have P(a)>103
 Extend the region until a set of four contiguous residues with
P(a)<100 is found
 If region’s average P(a)>103, length is >5, and ∑P(a)> ∑P(b),
alpha
Step 2: beta strands
 Find a region of five contiguous residues with at least three
with P(b)>105
 Extend the region until a set of four contiguous residues with
P(b)<100 is found
 If region’s average P(b)>105, and ∑P(b)> ∑P(a), beta
Chou-Fasman Algorithm


Step 3: beta turns
 For each residue f, determine the turn propensity (P(t)) for j, as
P(t) j = f(i) j *f(i+1) j+1 *f(i+2) j+2 *f(i+3) j+3
 A turn at postion if P(t) >0.000075, average P(turn) from j to
j+3 > 100, and ∑P(a)< ∑P(turn) > ∑P(b)
Step 4: overlaps
 If alpha region overlaps with beta, the region’s ∑P(a) and
∑P(b) determine the most likely structure in the overlapped
region
 If ∑P(a) > ∑P(b) for the overlapping region, alpha
 If ∑P(a) < ∑P(b) for the overlapping region, beta
 If ∑P(a) = ∑P(b), no valid call
Secondary structure prediction




Chou and Fasman (1974) based on the frequencies of amino acids
found in a helices, b-sheets, and turns.
Proline: occurs at turns, but not in a helices.
GOR (Garnier, Osguthorpe, Robson): related algorithm
Modern algorithms: use multiple sequence alignments and achieve
higher success rate (about 70-75%)
Page 427
Secondary structure prediction
Web servers:
GOR4
Jpred
NNPREDICT
PHD
Predator
PredictProtein
PSIPRED
SAM-T99sec
Table 11-3
Page 429
Secondary Structure Prediction by PSIRED





Prediction of regions of the protein that form alpha-helix,
beta-sheet, or random coil
http://bioinf.cs.ucl.ac.uk/psipred/
Based on neural networks
Uses Chou-Fasman-like algorithm but first does PSI-BLAST
search to get a collection of sequences related to the input
(searching for orthologous sequences)
Univ. College London, 1999
PSI-BLAST is performed in five steps
1. Select a query and search it against a protein database
2. PSI-BLAST constructs a multiple sequence alignment then
creates a “profile” or specialized position-specific
scoring matrix (PSSM)
Page 146
Inspect the blastp output to identify empirical “rules”
regarding amino acids tolerated at each position
R,I,K
C
D,E,T K,R,T
N,L,Y,G
1 M
2 K
3 W
4 V
5 W
6 A
7 L
8 L
9 L
10 L
11 A
12 A
13 W
14 A
15 A
16 A
...
37 S
38 G
39 T
40 W
41 Y
42 A
A
-1
-1
-3
0
-3
5
-2
-1
-1
-2
5
5
-2
3
2
4
2
0
0
-3
-2
4
R N D C Q E G H I L K M F P
-2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3
1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1
-3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4
20 amino acids
-3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3
-3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4
-2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1
-2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3
-3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3
-3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3
-2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3
-2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1
-2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1
all the amino acids
-3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3
from
position
-2
-1 -2
-1 -1 -21 to
4 -2 -2 -2 -1 -2 -3 -1
-1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1
the end of your PSI-2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1
BLAST query
-1 0 -1 -1 0 0
protein
-3 -1 -2 -3 -2 -2
-1 0 -1 -1
-3 -4 -5 -3
-2 -2 -3 -3
-2 -2 -2 -1
-1
-2
-2
-1
-1
-3
-2
-1
0
6
-2
-3
-3
0
-1
-2
-2
-3
2
-2
-2
-4
-1
-3
-2
-2
-3
-4
-1
-2
-1
-2
0
-2
-1
-3
-2
-1
-2
-3
-1
-2
-1
-1
-3
-4
-2
1
3
-3
S
-2
0
-3
-2
-3
1
-3
-2
-3
-3
1
1
-3
1
3
1
T
-1
-1
-3
0
-3
0
-1
-1
-1
-1
0
0
-2
-1
0
0
W
-2
-3
12
-3
12
-3
-2
-2
-2
-2
-3
-3
7
-3
-3
-3
Y
-1
-2
2
-1
2
-2
-1
0
-1
-1
-2
-2
0
-3
-2
-2
V
1
-3
-3
4
-3
0
1
3
2
1
0
0
0
-1
-2
-1
-1 4 1 -3
-2 0 -2 -3
-1 1 5 -3
-4 -3 -3 12
-3 -2 -2 2
-1 1 0 -3
-2
-3
-2
2
7
-2
-2
-4
0
-3
-1
0
1 M
2 K
3 W
4 V
5 W
6 A
7 L
8 L
9 L
10 L
11 A
12 A
13 W
14 A
15 A
16 A
...
37 S
38 G
39 T
40 W
41 Y
42 A
A
-1
-1
-3
0
-3
5
-2
-1
-1
-2
5
5
-2
3
2
4
R
-2
1
-3
-3
-3
-2
-2
-3
-3
-2
-2
-2
-3
-2
-1
-2
N
-2
0
-4
-3
-4
-2
-4
-3
-4
-4
-2
-2
-4
-1
0
-1
D
-3
1
-5
-4
-5
-2
-4
-4
-4
-4
-2
-2
-4
-2
-1
-2
C Q E G H I L K M
-2 -1 -2 -3 -2 1 2 -2 6
-4 2 4 -2 0 -3 -3 3 -2
-3 -2 -3 -3 -3 -3 -2 -3 -2
-1 -3 -3 -4 -4 3 1 -3 1
-3 -2 -3 -3 -3 -3 -2 -3 -2
-1 -1 -1 0 -2 -2 -2 -1 -1
-1 -2 -3 -4 -3 2 4 -3 2
-1 -3 -3 -4 -3 2 2 -3 1
-1
-2 -3
-4 a
-3given
2 4 -3 2
note
that
-1 -2 -3 -4 -3 2 4 -3 2
amino
as-1
-1
-1 -1 acid
0 -2 (such
-2 -2 -1
-1
-1 -1 0in
-2your
-2 -2 -1 -1
alanine)
-2 -2 -3 -4 -3 1 4 -3 2
query
-1
-1 -2 protein
4 -2 -2 can
-2 -1 -2
-2
2 0 different
2 -1 -3 -3 0 -2
receive
-1 -1 -1 3 -2 -2 -2 -1 -1
F
0
-4
1
-1
1
-3
0
3
0
0
-3
-3
1
-3
-3
-3
P
-3
-1
-4
-3
-4
-1
-3
-3
-3
-3
-1
-1
-3
-1
-1
-1
W
-2
-3
12
-3
12
-3
-2
-2
-2
-2
-3
-3
7
-3
-3
-3
Y
-1
-2
2
-1
2
-2
-1
0
-1
-1
-2
-2
0
-3
-2
-2
V
1
-3
-3
4
-3
0
1
3
2
1
0
0
0
-1
-2
-1
2
0
0
-3
-2
4
-1
-3
-1
-3
-2
-2
0
-1
0
-4
-2
-2
-1
-2
-1
-5
-3 -3 -2 -2 -3 2 -2 -1 -2 -1
-2 -1 -1 -1 0 -2 -2 -2 -1 -1
-3
-4
-2
1
3
-3
-1 4 1 -3
-2 0 -2 -3
-1 1 5 -3
-4 -3 -3 12
-3 -2 -2 2
-1 1 0 -3
-2
-3
-2
2
7
-2
-2
-4
0
-3
-1
0
scores for matching
-1
0 0 0 -1 -2 -3 0 -2
alanine—depending
-3 -2 -2 6 -2 -4 -4 -2 -3
on-1the
in-1 -1
-1
-1 position
-2 -2 -1 -1
-3
-2protein
-3 -3 -3 -3 -2 -3 -2
the
S
-2
0
-3
-2
-3
1
-3
-2
-3
-3
1
1
-3
1
3
1
T
-1
-1
-3
0
-3
0
-1
-1
-1
-1
0
0
-2
-1
0
0
1 M
2 K
3 W
4 V
5 W
6 A
7 L
8 L
9 L
10 L
11 A
12 A
13 W
14 A
15 A
16 A
...
37 S
38 G
39 T
40 W
41 Y
42 A
A
-1
-1
-3
0
-3
5
-2
-1
-1
-2
5
5
-2
3
2
4
R
-2
1
-3
-3
-3
-2
-2
-3
-3
-2
-2
-2
-3
-2
-1
-2
N
-2
0
-4
-3
-4
-2
-4
-3
-4
-4
-2
-2
-4
-1
0
-1
D
-3
1
-5
-4
-5
-2
-4
-4
-4
-4
-2
-2
-4
-2
-1
-2
2
0
0
-3
-2
4
-1
-3
-1
-3
-2
-2
0
-1
0
-4
-2
-2
-1
-2
-1
-5
-3
-2
C Q E G H I L K M
-2 -1 -2 -3 -2 1 2 -2 6
-4 2 4 -2 0 -3 -3 3 -2
-3 -2 -3 -3 -3 -3 -2 -3 -2
-1 -3 -3 -4 -4 3 1 -3 1
-3 -2 -3 -3 -3 -3 -2 -3 -2
-1 -1 -1 0 -2 -2 -2 -1 -1
-1 -2 -3 -4 -3 2 4 -3 2
-1 -3 -3 -4 -3 2 2 -3 1
-1 note
-2 -3 that
-4 -3a given
2 4 -3 2
-1 -2 -3 -4 -3 2 4 -3 2
-1 amino
-1 -1 0acid
-2 -2(such
-2 -1 as
-1
-1 tryptophan)
-1 -1 0 -2 -2in
-2your
-1 -1
-2 -2 -3 -4 -3 1 4 -3 2
can
-1 query
-1 -2 protein
4 -2 -2 -2
-1 -2
-2 receive
2 0 2 different
-1 -3 -3 0 -2
-1 -1 -1 3 -2 -2 -2 -1 -1
F
0
-4
1
-1
1
-3
0
3
0
0
-3
-3
1
-3
-3
-3
scores for matching
-1 tryptophan—
0 0 0 -1 -2 -3 0 -2 -3
-3 -2 -2 6 -2 -4 -4 -2 -3 -4
on-1the
-1 depending
-1 -1 -2 -2 -1
-1 -1 -2
-3 position
-2 -3 -3 -3
-2 -3 -2 1
in-3
the
-3 -2 -2 -3 2 -2 -1 -2 -1 3
-1 protein
-1 -1 0 -2 -2 -2 -1 -1 -3
P
-3
-1
-4
-3
-4
-1
-3
-3
-3
-3
-1
-1
-3
-1
-1
-1
S
-2
0
-3
-2
-3
1
-3
-2
-3
-3
1
1
-3
1
3
1
T
-1
-1
-3
0
-3
0
-1
-1
-1
-1
0
0
-2
-1
0
0
W
-2
-3
12
-3
12
-3
-2
-2
-2
-2
-3
-3
7
-3
-3
-3
Y
-1
-2
2
-1
2
-2
-1
0
-1
-1
-2
-2
0
-3
-2
-2
V
1
-3
-3
4
-3
0
1
3
2
1
0
0
0
-1
-2
-1
-1 4 1 -3
-2 0 -2 -3
-1 1 5 -3
-4 -3 -3 12
-3 -2 -2 2
-1 1 0 -3
-2
-3
-2
2
7
-2
-2
-4
0
-3
-1
0
PSI-BLAST is performed in five steps
1. Select a query and search it against a protein database
2. PSI-BLAST constructs a multiple sequence alignment then creates a
“profile” or specialized position-specific scoring matrix (PSSM)
3. The PSSM is used as a query against the database
4. PSI-BLAST estimates statistical significance (E values)
1. Repeat steps [3] and [4] iteratively, typically 5 times.
At each new search, a new profile is used as the query
Page 146
SRC protein




Tyrosine kinase
Enzyme putting a phophate group on tyrosine AA
(phosphorylation)
Activates an inactive protein, eventually activates celldivision proteins
NP_005408
>gi|4885609|ref|NP_005408.1| proto-oncogene tyrosine-protein kinase Src [Homo sapiens]
MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSS
DTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYV
APSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKL
DSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGC
FGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLL
DFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYT
ARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPEC
PESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL
Examining Crystal Structure




Cn3D: NCBI structure viewer and modeling tool
DeppView: SWISSPROT
JMOL
NCBI Structure database
 Links to NCBI MMDB (Molecular Modeling Database)
 MMDB contains experimentally verified protein structures

SRC – MMDB ID 56157, PDB ID 1FMK

View Structure from NCBI Structure database
 Opens up Cn3D window
 Click to rotate; Ctrl_click to zoom; Shift_clcik to move
 Rendering and coloring menus
Tertiary structure



3D arrangment of all atoms in the module
Considers arrangement of helical and sheet sections,
conformations of side chains, arrangement of atoms of side
chains, etc.
Experimentally determined by
 X-ray crystallography –
measure diffraction patterns of
atoms

NMR (Nuclear Magnetic
Resonance) spectroscopy –
use protein samples in
aqueous solution
• Tertiary structure of α-lactalbumin
myoglobin
Protein families

Groups of genes of identical or similar sequence are common
 Sometimes, repetition of identical sequences is correlated with the
synthesis of increased quantities of a gene product






e.g., a genome contains multiple copies of ribosomal RNAs
Human chromosome 1 has 2000 genes for 5S rRNA (sedimentation
coefficient), and chr 13, 14, 15, 21 and 22 have 280 copies of a
repeat unit made up of 28S, 5.8S and 18S
Amplication of rRNA genes evolved because of heavy demand for
rRNA synthesis during cell division
These rRNA genes are examples of protein families having
identical or near identical sequences
Sequence similarities indicate a common evolutionary origin
α- and β-globin families have distinct sequence similarities evolved
from a single ancestral globin gene
Protein families and superfamilies

Dayhoff classification, 1978
 Protein families – at least 50 % AA sequence similar (based on
physico-chemical AA features)
 Related proteins with less similarity (35%) belong to a
superfamily, may have quite diverse functions
 α- and β-globins are classified as two separate families, and
together with myoglobins form the globin superfamily
 families have distinct sequence similarities evolved from a
single ancestral globin gene
Protein family database

Pattern or secondary database derived from sequences





a pattern may be the most conserved aspects of sequence families
The most conserved part may vary between species
Use scoring system to account for some variability
Position-specific scoring matrix (PSSM) or Profile
 Contrast to a pairwise alignment, having the same weight
regardless of positions
Protein family databases are derived by different analytical
techniques


But, trying to find motifs, conserved regions, considered to reflect
shared structural or functional characteristics
Three groups: single motifs, multiple motifs, or full domain alignments
Protein family databases

Pattern or secondary database derived from sequences
Data source
Stored info
PROSITE
Swiss-Prot
Regular expressions (patterns) of
single most conserved motif
Profiles
Swiss-Prot
Weighted matrices (profiles) of
position-sensitive weights
PRINTS
Swiss-Prot and
TrEMBL
Aligned motifs (fingerprints)
Pfam
Swiss-Prot and
TrEMBl
multiple sequence alignment of a
protein domain or conserved region
Blocks
interPro/PRINTS
Aligned motifs (blocks)
eMOTIF
Blocks/PRINTS
Permissive regular expressions
Single Motif Method

Regular expression
 PROSITE
 PDB 1ivy



Carboxypet_Ser_His (PS00560)
[LIVF]-x2-[[LIVSTA]-x[IVPST]-[GSDNQL]-[SAGV]-[SG]-H-x-[IVAQ]P-x(3)-[PSA]
 [] – any of the enclosed symbols
 X- any residue
 (3) – number of repeats
Fuzzy regular expression
 Build regular expressions with info on shared biochemical
properties of AA
 Provide flexibility according to AA group clustering
Multiple motif methods

PRINTS


Encode multiple motifs (called fingerprints) in ungapped,
unweighted local alignments
BLOCKS







Derived from PROSITE and PRINTS
Use the most highly conserved regions in protein families in
PROSITE
Use motif-finding algorithm to generate a large number of
candidate blocks
Initially, three conserved AA positions anywhere in the alignment
are identified and used as anchors
Blocks are iteratively extended and ultimately encoded as
ungapped local alignments
Graph theory is used to assemble a best set of blocks for a given
family
Use position specific scoring matrix (PSSM), similar to a profile
Full domain alignment


Profiles
 Use family-based scoring matrix via dynamic programming
 Has position-specific info on insertions and deletions in the
sequence family
Hidden Markov Model (HMM)
 PFAM, SMART, TIGRFAM represent full domain alignments
as HMMs
 PFAM



Represents each family as seed alignment, full alignment,
and an HMM
Seed contains representative members of the family
Full alignment contains all members of the family as detected
with HMM constructed from seen alignment
Structure-based Sequence Alignment


Well-known that sequence alignment is not correct by sequence
similarity alone and that similar structure but no sequence
similarity
Sequence alignment is augmented by structural alignments
 COMPASS< HOMSTRAD< PALI, ..
Protein Structure
Comparison/Classification
Protein structures



Domain
 Polypeptide chain in a protein folds into a ‘tertiary’
structure
 One or more compact globular regions called domains
 The tertiary structure associated with a domain region is
also described as a protein fold
Multi-domain
 Proteins with polypeptide chains fold into several domains
 Nearly half the known globular structures are multidomain,
more than half in two domains
Automatic structure comparison methods are introduced in
1970s shortly after the first crystal structures are stored in PDB
Structure comparison algorithms


Two main components in structure comparison algorithms
 Scoring similarities in structural features
 Optimization strategy maximizing similarities measured
Most are based on geometric properties from 3D coordinates
 Intermolecular method
 Superpose structures by minimizing distance between
superposed position
 Intra
 Compare sets of internal distances between positions
to identify an alignment maximizing the number of
equivalent positions
 Distance is described by RMSD (Root Mean Square
Deviation), squared root of the average squared
distance between equivalent atoms
Inter vs. Intra
RMSD
Distant homolog


Structure is more
conserved than
sequences during
evolution
Structural similarity
between distant
homologs can be found
 Pairwise sequence
similarity
 SSAP structural
similarity score in
parenthesis (0 –
100)
Distant homolog
Structural variations in protein families
Structure comparison algorithms



SSAP, 1989
 Residue level, Intra, Dynamic programming
DALI, 1993
 Residue fragment level, intra, Monte Carlo
optimization
COMPARER, 1990
 Multiple element level, both, Dynamic programming
Structure classification hierarchy





Class level -- proteins are grouped according to their
structural class (composition of residues in a α -helical
and β-strand conformations)
 Mainly- α, mainly- β, alternating α- β, α plus β
(mainly- α and – β are segregated)
Architecture
 the manner by which secondary structure elements
are packed together (arrangement of sec. structures
in 3D space)
Fold group (topology)
 Orientation of sec. structures and the connectivity
between them
Superfamily
Family
Hierarchy example
Protein Structure databases

PDB


Over 20,000 entries deduced from X-ray diffraction, NMR or modeling
Massively redundant
 1FMK, 1BK5, 2F9C, ..
Protein Structure databases

SCOP (Structural Classification of Proteins)





Multi-domain protein is split into its constituent domains
Known structures are classified according to evolutionary and
structural relationship
Domains in SCOP are grouped by species and hierarchically classified
into families, superfamilies, folds and classes
 Family level – group together domains with celar sequence
similarities
 Superfamily – group of domains with structural and functional
evidence for their descent from a common evolutionary ancestor
 Gold – group of domains with the same major secondary
structure with the same chain topology
Domains identified manually by visually inspecting structures
Proteins in the same superfamily often have the same function
Protein Structure databases

CATH (Class, Architecture, Topology, Homology)



Homology – clustered domains with 35% sequence identity and
shared common ancestry
800 fold families, 10 of which are super-folds
2009 www.cs.uml.edu/~kim/580/08_cath.pdf
Structure classification

Most structure classifications are established at the
domain level


Thought to be an important evolutionary unit and easier to
determine domain boundaries from structural data than
from sequence data
Criteria for assessing domain regions within a structure




The domain possesses a compact globular structure
Residues within a domain make more internal contacts
than to residues in the rest of polypeptide
Secondary structure elements are usually not shared with
other regions of the polypeptide
There is evidence for existence of this region as an
evolutionary unit
CATH classifications
Multi-domain structures
Protein Function/Structure
Prediction
Protein Function Prediction



In the absense of experimental data, function of a protein is usually
inferred from its sequence similarity to a protein of known function
 The more similar the sequence, the more similar the function is
likely to be
 Not always true
Can clues to function be derived directly from 3D structure
Definition of function
 Function can be described at many levels: biochemical,
biological processes, pathways, organ level
 Proteins are annotated at different degrees of functional
specificity: ubiquitin-like dome, signaling protein, ..
 GO (Gene Ontology) scheme
Protein Function Prediction


Sequence-based – largely unreliable
Profile-based
Profiles are constructed from sequences of whole protein families with
families are grouped by 3D structure or function (as in Pfam)
 Start with sequences matched by an initial search, iteratively pull in
more remote homologues
 More sensitivity than simple sequence comparison because profiles
implicitly contain information on which residues within the family are
well conserved and which sites are more variable
Structure-based
 Fold-based
 Proteins sharing simlar functions often shave similar folds,
resulting from descent from a common ancestral protein
 Sometimes, function of proteins alter during evolution with the
folds unchanged
 Thus, fold match is not always reliable
 Surface clefts and binding pockets


Chap. 12 RNA Structures
RNA structure

Stem-loop structure
RNA structure

A loop structure




A loop between i and j when base at i pairs with base at j
Base at i+1 pairs with at base j
Or base at i pairs with base at j-1
Or a multiple loop
RNA secondary structure

Search for minimum free energy


Gibbs free energy at 37 degrees (C)
Free energy increments of base pairs
are counted as stacks of adjacent
pairs
 Successive CGs: -3.3 kcal/mol
 Unfavorable loop initiation
energy to constrain bases in a
loop
RNA structure prediction

Ad-hoc approach




Simply look at a strand and find areas where base pairing can
occur
Possible to find many locations where folds can occur
Prediction should be able to determine the most likely one
 What should be the criteria ?
1980, Nussinov-Jacobson Algorithm



More stable one is the most likely structure
Find the fold that forms the greatest number of base pairs
(base-pairing lowers the overall energy of the strand, more
stable)
Checking for all possible folds is impossible -> dynamic
programming
Nussinov-Jacobson Algorithm



Create an nxn matrix for a sequence with n bases
Initialize the diagonal to 0
Fill the matrix with the largest number of base pairs (S)
S(i+1, j-1) + w(i,j)
S(i,j) = max [ S(i+1, j)
]
S(i, j-1)
max[S(I,k) + S(k+1,j)}
w(I,j) = 1 if base I can be paired with base j