What is Local structure?
Download
Report
Transcript What is Local structure?
Local Statistical Dependencies
in Protein Structure: Discovery,
Evaluation, Prediction and
Applications
Advancement to Candidacy
Computer Science Department
by Rachel Karchin
Advisor: Kevin Karplus
2
Outline
Protein structure - primary,
secondary, tertiary
Fold recognition, local and
secondary structure
Alphabets of local structure
Designing and evaluating local
structure alphabets
Improving fold recognition
3
Molecular structure of proteins
Proteins are large, organic molecules
composed of smaller molecules called
amino acids.
threonine cysteine arginine
Ball-and-stick atomic model of Crambin
plant seed protein with 44 amino acids
4
The amino acids
There are 20 kinds of
amino acids found in
natural proteins.
All share a common
structure.
R
amine group
side chain
carboxyl group
alpha carbon
(with attached hydrogen)
Biochemistry Mathews, 3ed. AddisonWesley
5
Primary structure
Proteins consist of one or more
polypeptide chains of amino acids
connected by peptide bonds.
The sequence of
linked amino acids
along the chain is
called the protein’s
primary structure.
Phe-Leu-Ser-Cys . . .
FLSC . . .
Access Excellence NHGRI Graphics Gallery
6
Secondary structure
Symmetric patterns of hydrogen bonds
between amino acids.
Helix. H-bonds between residues close in
primary sequence.
Anthony Day/Pace et. al. 1996
7
Secondary structure
Strand. H-bonds between residues not
close in primary sequence.
Anthony Day/Pace et. al. 1996
8
Protein Folding
In an aqueous environment (such as cell
cytoplasm), polypeptide chains fold into 3D
shapes (tertiary structure).
9
From primary to tertiary structure
A protein’s 3D shape is determined by its
primary amino acid sequence.
Anfinsen et. al. 1963.
Predicting tertiary structure from amino
acid sequence is an unsolved problem.
– Difficult to model the
energies that stabilize a
protein molecule.
– Conformational search
space is enormous.
Laboratory of Molecular
Biophysics, University of Oxford
10
Fold recognition
In nature, proteins are observed to assume on
the order of a thousand shapes or “folds”.
Biochemistry Mathews, 3ed. AddisonWesley
11
Fold recognition
Given an amino acid sequence target:
– search a set of known folds by aligning target and a
template fold representative
– predict the fold that gets the best scoring alignment
Target amino acid sequence
YLAADTYK
Template
Fold library
Template amino acid sequence
Target/template Score:
FISSETCN
MEPSSYV
7
TGLIRKN
21
2
12
Twilight zone sequence
relationships
This method is very effective when target
and template have > 30% sequence
identity.
Approximately 1/3 of protein sequences
can be assigned folds and modeled this
way.
We would like to extend the method to
sequences in the twilight zone
(< 30% identity to any sequence of known
structure).
13
SAM-T98
Build a target HMM of amino acid
frequencies from a multiple alignment of
target plus homologs (SAM-T98).
Target amino acid sequence
YLAADTYK
Multiple
alignment
Courtesy of K. Karplus
Target amino acid HMM
Search
for
homologs
Protein Database
YLAADTYK
FISTE-HR
HVATD-H-ITA--HR
YLASDS-R
14
SAM-T98
Amino acid HMM for target. Amino acid
strings for templates
Three -fold increase in recognizing twilight
zone similarities (Park et. al. 1998)
Target amino acid HMM
Template Fold library
Courtesy of K. Karplus
Template amino acid sequence
Target/template Score:
FISSETCN
MEPSSYV
7
TGLIRKN
21
2
15
SAM-T98 enhancements
Two-way scoring
Augment the method with secondary
structure information.
16
Two-way SAM-T98
Also build amino acid HMMs for templates.
Do 2-way scoring to strengthen recognition
of twilight zone relationships.
Target amino acid sequence
Template Fold library
YLAADTYK
Template amino acid HMMs
Target/template Score:
19
82
31
17
Secondary structure
DSSP alphabet (Kabsch and Sander 1983).
Classifies the secondary structure of a residue
using known tertiary structure.
Repeating
turns:
Basic patterns:
turn
T
bend
S
Other:
random coil
C
Repeating
bridges:
bridge
B
3-10 helix alpha helix
G
H
pi helix
I
beta strand
E
Biochemistry Mathews, 3ed. AddisonWesley
18
Secondary structure
Alternatives to DSSP definitions.
– Collapse 8 classes to 3: H,E,C
– Other programs to automate
assignment:
•
•
•
•
•
Richards and Kundrot (1988) Define
Sklenar (1989) P-Curve
Adzhubei and Sternberg (1993)
Frishman and Argos (1995) STRIDE
King and Johnson (1999) xlsstr
19
Predicting secondary structure
Extensive research on predicting
secondary structure from primary
sequence.
Neural nets are most successful
approach.
– PHD (Rost and Sander 1996)
– Predict_2nd (Karplus and Barrett 1998)
Best methods around 75-80% accurate
20
Secondary structure and fold
recognition
Predicted secondary structure shown useful
for fold recognition (Russell et. al. 1998).
Fold recognition accuracy correlated with
secondary structure prediction accuracy
(Di Francesco 1995, 1997, 1999).
Why?
– Structure more conserved than sequence.
– Proteins in the same fold family have similar
topologies (secondary structure elements have
similar lengths, spatial organization and
connectivities).
21
Two-track SAM-T2K
P(H)
Predicted probability vectors of secondary
structure added to target HMM
Target two-track
P(E)
P(C)
Y
L
A
A
D
T
Y
K
Courtesy of C. Barrett
Multiple
alignment
YLAADTYK
FISTE-HR
HVATD-H-ITA--HR
Target amino acid sequence
YLAADTYK
H
0.65
0.15
0.01
0.47
0.85
0.32
0.81
0.5
E
0.2
0.7
0.04
0.45
0.1
0.18
0.09
0.25
C
0.15
0.25
0.9
0.08
0.05
0.5
0.1
0.15
HMM
Courtesy of K. Karplus
22
Two-track SAM-T2K
Search template library of sequence pairs
with two-track target HMM
Target two-track HMM
Template Fold library
Courtesy of K. Karplus
Template with 2 sequence pairs
Target/template Score:
FISSETCN
CCEECHHH
22
MEPSSYV
HHHHCCE
68
TGLIRKN
EEECEEE
15
23
Motivation for alternatives to
secondary structure classes
What’s wrong with secondary structure
classes?
– The most widely used secondary structure
alphabet (3-state DSSP) is crude (Helix,
Strand, Coil).
– Secondary structure classes are
ambiguous.
• Automated assignment methods disagree.
• 63% agreement between DSSP, Define and
P-Curve (Collc’h et. al. 1993).
24
Local structure and fold
recognition
What is Local structure?
– describes environment of a residue
– a residue’s relationship to neighbors
Can use this information to predict fold
from primary structure.
Requires comparing local structure of
target and template.
Known
Must predict (easier than 3d)
25
Low level descriptions of local
structure
Lowest level representation of protein
structure - atomic position vectors.
Atom
Residue Position vector
Y
Z
No. Type Type No. X
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
Conformations of Biopolymers
IUPAC-IUB
1 CA
2 C
3 O
4 N
5 CA
6 C
7 O
8 CB
9 N
10 CA
THR 1
THR 1
THR 1
SER 2
SER 2
SER 2
SER 2
SER 2
CYS 3
CYS 3
7.047 14.099
16.967 12.784
15.685 12.755
15.115 11.555
13.856 11.469
14.164 10.785
14.993 9.862
12.732 10.711
13.488 11.241
13.660 10.707
3.625
4.338
5.133
5.265
6.066
7.379
7.443
5.261
8.417
9.787
26
Low level descriptions of local
structure
“One level up”. From atomic position
vectors can derive a list of properties that
describe a residue’s local environment.
Conformations of Biopolymers
IUPAC-IUB
27
Dihedral and bond angles
Dihedral angles are
defined by 4 atoms.
Conformations of Biopolymers
IUPAC-IUB
Bond angles are
defined by 3 atoms.
Conformations of Biopolymers
IUPAC-IUB
28
Dihedral angles: Phi, Psi, Omega
The 6 atoms in each peptide unit
lie in the same plane.
= 180 (trans)
or 0 (cis)
and
free to rotate
ω
ω
Biochemistry Mathews, 3ed. AddisonWesley
29
Dihedral angles: Phi, Psi, Omega
Result: good approximation of polypeptide
backbone is list of (,) pairs
( cis is rare).
(,) pairs often
represented on a
plane called the
Ramachandran plot.
http://www.biochem.artizona.edu
Biochemistry 462A Lecture Notes
A small gallery of properties:
the geometry of local structure
Kappa. Virtual bond angle between
C of residues i-2, i, i+2
Zeta. Dihedral angle between
carbonyl bonds of residues i and i-1
Alpha. Virtual dihedral angle between
C of residues i-1, i, i+1, i+2
Tau. Virtual bond angle between
C of residues i-1, i, i+1
30
31
Relationship of a residue to its
neighbors
Density measures. How many residues
are within a given distance?
12 neighboring residues
within 6 A radius
Count of H-bond partners.
2 H-bond partners
32
Existing local structure alphabets
Approximately 30 alphabets of local
structure in the literature.
Can they be used to improve fold
recognition?
33
Phi/psi alphabets
Classes based on partition of phi/psi space
Bystroff et. al. 2000.
10 classes:
BEbdeGHLIx
Sun et. al. 1996
DSSP H,E plus
5 phi/psi classes: a b e l t
Bystroff et. al. 2000
Kang et. al. 1993.
1296 classes: uniform partitioning by 10
34
Backbone fragment alphabets
Classes based on clustering low-level
properties of contiguous series of residues.
Unger et. al. 1987
~100 6-residue fragments
k-nearest neighbor clustering
by RMSD of C atoms
Centroid of each cluster
selected as building block
Unger et. al. 1987
35
Backbone fragment alphabets
De Brevern et. al. 2000
Protein Building Blocks
(PBBs).
16 classes of 5-residue
fragments.
SOM clustering of vectors of
8 dihedral angles ( and ).
De Brevern et. al. 2000
36
Desired properties of local
structural alphabets
For purposes of improving fold
recognition:
– Predictable from primary sequence
– Conserved within a fold family
37
Comparison of existing local
structure alphabets
Only a few of the alphabets have been
tested for predictability.
None of the alphabets have been tested for
conservation within fold families.
38
Designing a Local Structure
Alphabet
Extract properties with respect to each
residue in the dataset.
Selected PDB
structures
Property
extraction
Selected
property:
TCO
i-1
i
PDBNo
1
2
3
4
5
6
.
.
AA
M
L
S
P
E
V
TCO
-0.3
-0.34
0.91
0.935
-0.1
0.2
39
Designing a Local Structure
Alphabet
Partition the data into k populations.
PDBNo
1
2
3
4
5
6
.
.
AA
M
L
S
P
E
V
TCO
-0.3
-0.34
0.91
0.935
-0.1
0.2
Class A
PDBNo
1
2
5
AA
M
L
E
TCO
-0.3
-0.34
-0.1
Class B
PDBNo
3
4
6
AA
S
P
V
TCO
0.91
0.935
0.2
Unsupervised
Learning
Algorithm
O
XX X
-1
-0.5
Class A
0
0.5
Class B
OO
1
40
Designing a Local Structure
Alphabet
i
D1 dison3:
H-bond length
from Oi to Ni+3
i+3
Selected property:
KJ descriptor vector*:
[,, d1, d2, d3]
D2 dison4:
H-bond length
from Oi to Ni+4
i
i+4
i-1
i
i-1
i
ZETA
i+1
TAU
i
i+3
D3 discn3:
length from
Ci to Ni+3
* Descriptor vector of key geometric properties
identified by King and Johnson 1999
41
Designing a Local Structure
Alphabet
Extract properties with respect to each
residue in the dataset.
Selected PDB
structures
Property
extraction
Selected property:
KJ descriptor vector:
[, , d1, d2, d3]
PDBNo
1
2
3
4
.
.
.
AA
M
L
S
P
KJDV
[13.6, 9 2.9, 3.7, 3.1, 4.1]
[14.4, 9, 5.7,4 .9, 7.1, 4.9]
[19.8, 100.3, 7.2, 10.1, 6.9]
[18.1, 116.2, 6.7, 9.2,6 .9]
42
Designing a Local Structure
Alphabet
Clustering multi-dimensional data points.
PDBNo
1
2
3
4
.
.
.
AA
M
L
S
P
KJDV
[13.6, 9 2.9, 3.7, 3.1, 4.1]
[14.4, 9, 5.7,4 .9, 7.1, 4.9]
[19.8, 100.3, 7.2, 10.1, 6.9]
[18.1, 116.2, 6.7, 9.2,6 .9]
Components in different units. Scale to
same range?
For very high dimensional vectors require
feature reduction.
43
Evaluation protocol
Protocol is based on:
– testing candidate alphabets for their
conservation within fold families.
– testing predictability of candidate
alphabets
– testing improvements in fold
recognition when candidate alphabets
are used.
44
Evaluation Protocol: string
translation
>2abd
MDAAVKTG
>4eca
MELVIRSG
. . .
Selected
alphabet
Selected PDB
structures
Stringbuilder
>2abd
CAAABCAB
>4eca
ACBBABCA
. . .
Positionequivalent
strings in
new
alphabet
45
Evaluation Protocol: alignment
translation
MD-AAVKTG
ME-LVIRSG
M-SAGCRDK
MEA-SC-E-
Positionequivalent
strings in
new alphabet
Fold family
alignments
Alignment
builder
CA-AABCAB
AC-BBABCA
C-AACCBBC
CCA-BB-APositionequivalent
alignments
in new
alphabet
46
Evaluation Protocol: alphabet
conservation
Positionequivalent
alignments
in new
alphabet
Conserved?
Average entropy in columns of
alignments.
Relative entropy of substitution
matrix constructed from
alignments (Altschul 91).
CA-AABCAB
AC-BBABCA
C-AACCBBC
CCA-BB-A-
47
Evaluation Protocol: alphabet
predictability
P(A)
Test predictability
with Predict_2nd
neural net.
Improve on
neural net
performance with
alternate
methods.
P(B)
P(C)
Courtesy of C. Barrett
Positionequivalent
strings in
new alphabet
Predictable?
48
Evaluation Protocol: fold
recognition
Build a fold library that incorporates the
local structure alphabet and do fold
recognition testing using this library.
49
Incorporating local structure
alphabets into a fold library
Simplest. Use predicted local structure
string for target and known local structure
string for templates.
Template Fold library
Target local structure string
ABBCACAB
PROBLEM!
Wrong letter predicted.
Template local structure string
Target/template Score:
CCABBBAC
AACBCAA
7
CAACBBB
21
2
50
Incorporating local structure
information into a fold library
Use several strings (amino acid and local
structure) for target and templates.
Target with string tuple
YLAADTYK
ABBCACAB
WYTZTTVU
Template Fold library
PROBLEM!
Wrong letters predicted.
Template with string tuples
Target/template Score:
FISSETCN
CCABBBAC
YVUUTZVV
6
MEPSSYV
AACBCAA
TTYUVWZ
23
TGLIRKN
CAACBBB
YUUUVZW
5
51
Extending the SAM-T2K method
with local structure information
Add tracks to the target HMM.
Search template library of sequence tuples
with multi-track target HMM.
Target multi-track HMM
Template with sequence tuples
Target/template Score:
Template Fold library
FISSETCN
CCABBBAC
YVUUTZVV
75
MEPSSYV
AACBCAA
TTYUVWZ
TGLIRKN
CAACBBB
YUUUVZW
3
22
52
Extending the SAM-T2K method
with local structure information
Y
L
A
A
D
T
Y
K
A
0.65
0.15
0.01
0.47
0.85
0.32
0.81
0.5
Adding local structure strings to the
template HMM. Enable 2-way HMM scoring.
B
0.2
0.7
0.04
0.45
0.1
0.18
0.09
0.25
C
0.15
0.25
0.9
0.08
0.05
0.5
0.1
0.15
Target
Template Fold library
YLAADTYK
ABBCACAB
WYTZTTVU
Template amino acid HMMs
plus local structure strings
Target/template Score:
CCABBBAC
YVUUTZVV
8
AACBCAA
TTYUVWZ
CAACBBB
YUUUVZW
24
49
53
Extending the SAM-T2K method
with local structure information
Build multi-track HMMs for target and
template.
Target multi-track HMM
Template Fold library
Template multi-track HMMs
Target/template Score:
6
23
5
54
Evaluation Protocol: fold
recognition
Fold
classification
database
Non-redundant
Fold test
set
119l
12asA
153l
16pk
16vpA
. . .
T4 Lysozyme
Asparagine Synthetase
Goose Lysozyme
Phosphoglycerate Kinase
VP16 regulatory protein
Template Fold library
Target
119l
119l
12asA
153l
16pk
16vpA
. . .
Templates:
12asA
153l
16pk
Target/template Score:
12
2
71
55
Evaluation Protocol: fold
recognition
+=Same fold
False Positives
old PSI-blast
2000
PSI-blast
1000
SAM-T2K
SAM-T2K
EHL
50-50
500 SAM-T2K EBGHTL 50-50
DALI
200
100
50
20
10
5
2
1
500
1000
2000
True Positives
5000
10000
courtesy of K. Karplus
56
Research Schedule
Year 1:
Find a local structure alphabet that
improves fold recognition. Build a fold
library that uses the alphabet. Put up a
webserver for public use of the library.
Summer 2002
CASP5
57
Research Schedule
Year 2:
Design more alphabets. Compare and
combine new and existing alphabets.
Expand the methods to continuous-value
predictions. Incorporate best combination
into my fold library.
June 2003
Produce completed dissertation.
58
Conclusion
Focus of the work:
– Evaluate existing local structure alphabets
– Design and evaluate novel local structure
alphabets
Evaluation protocol:
– conservation
– predictability
– fold recognition