Transcript downloading

Design of score functions for
recognition of protein folds
T. Galor
Joint work with
Ron Elber
Thorsten Joachims
Structures are vital to understand
protein function
Researches are interested
to know:
• what are the active
The HIV protease plays a major
role in cell infection.
site residues for
enzymatic reaction?
• Where are the active
sites for transmission of
signal?
There is a need for a
tool for finding
Protein structure
It is important for drug design to
know the binding sites.
From Homologues to structure
(evolutionary related protein)
We present procedures for identifying Homologues
structures from a given library of structures that
span the Protein Data Bank.
• An annotated homologues structure may give a
clue for the function of the probe protein
• The homologues structure may be used as scaffold
for modeling the probe structure.
Sequence alignment
S1: AWHFFAI
A
A
W H
- H
S2: AHGI
F
G
F
-
A
-
I
I
Only sequence information is
used ,both in the query and target
Some Homologues do not share
high sequence similarity
Myoglobin 1mba and
leghemoglobin 1bin:A share
similar structure but their
sequence identity is small
14%.
When sequence similarity fails
we can use different similarity
measure- Threading
Fold prediction by threading
when sequence identity is low
Evaluate how amino acid
sequence fits into known
three-dimensional (3D)
protein structure.
Test fitness of the probe
sequence to structures
from existing library.
Choose the most significant
fit from sequence/structure
matches.
MGFPIPDPYV … KGKI
Definition of protein characteristics
The number of contacts determines if
a site is buried or exposed
Characterize a protein is by its amino acid
sequence S,
S: AW….HI
Or by its structure X,
W
A
H
I
X: s1s0….s2s2
A structure site can be classified according
to the number of contacts with its
neighbors. Polar sites have law number of
contacts. Hydrophobic sites have more.
Focus on threading
S1: AWGHKI
Sequence information is used for
the probe protein. Structural
information for the target.
A W - G H K - I
s1 - s3 - s1 s 1 s5 s 1
G
H
K
I
X2: s1s0s2s3s0s3
Many ways to align two proteins
S1: AWHFFAI
S2: AHGI
A W H F - F A I
s1 s1 s2 s3 s3 s3 s4 s3
There are many
different alignments
for two sequence:
2
A W - H F F A I
s1 s1 s2 s3 s3 s3 s4 s3
nm
m the length of S1
,
n the length of S2
The need to score the alignments
Counts the number of amino acid type ai
placed at site type sj
E  S1  S 2,W  

 x  a , s    x   s 
20
i , j 1
W    ai , s j  ,   s j 
A H H -
15
i, j
i

W A -
s1 s1 s2 s2 s1 s1 s2
j
j 1
j
Costs
j
Counts the
number of gaps
placed at site sj
The cost matrix W
A numerical value is
assigned to every cell
depending on the fitness
of assigning an amino
acid into structural site.
These may be simple
scores or more
complicated. Related to
chemical similarities or
frequency observed
S1
s2
s3
s4
s5
s6
s7
s8
s9
A
R
N
D
C
Q
E
G
H
-0.02
0.1
-0.22
0.02
-0.13
0.02
0.05
-0.05
-0.17
-0.06
-0.23
-0.07
0.020
-0.37
0.21
-0.03
-0.06
-0.05
-0.02
-0.01
-.01
0.34
0.72
0.09
0.1
0.05
-0.25
-0.17
0.12
0.29
0.37
-0.70
0.22
0.4
0.14
-0.31
-0.13
0.22
0.2
0.68
-1.13
0.33
0.45
0.38
0.24
0.02
0.32
0.17
0.43
-1.16
0.02
0.7
0.42
0.36
0.12
-0.1
0.3
0.43
-1.27
0.46
0.39
0.2
0.27
-0.07
0.91
-0.12
-0.01
-1.6
0.5
0.83
0.29
-0.71
0.83
1.36
0.11
0.35
-1.71
0.82
10
2.12
3.38
How to find the optimal
alignment
S2: MPR X2: s1s1s3
Dynamic programming
algorithm is an O(nm)
algorithm which find the
optimal alignment given a
cost matrix and gap cost
from an exponential number
(2^(n+m)) alignments.
S1: PVRC
P
V
R
C
P
V
R
C
-
-
s1
s2
-
s3
s1
s1
s3
Present alignment scores are not
accurate enough
There is the need for
• A better scoring cost.
• A better gap cost.
Or
• Accompany the similarity measure
(Seq2Seq/Seq2Struc) with a statistical
measure which will enhance the signal.
The Z score measure
• Define a set of random sequences. Each random
sequence is threaded to the same target structure.
• We compare the alignment cost of the true probe
sequence with the average cost of random
sequences.
• This enables us to estimate the significance of the
score of the probe compared to a typical score of a
random sequence.
A novel method for designing
energy cost function parameters
Current methods try to
recognize native versus
decoys.
E  N  D,W   E  N  N ,W   0
But the goal is to identify Homologues
The answer is
DIFFICULTIES
Not exact
Exponential number
of alignments
E  N  D,W   E  N  H ,W   0
Recognize Homologues versus decoys
The steps to design energy cost
function parameters
E  S  X ,W 
An energy function
T  Si , H i D i i 1
n
A training set
An algorithm to train
proteins to recognize
their correct folds
Requirments:
E  S  D, W   E  S  H ,W   0
An algorithm to solve the
“training” conditions.
Evaluation of the new score
parameters.
Use Mathematical
programming to solve the
set of equations
Evaluate the performance on
an independent set.
Some notation
alignment
An alignment of protein
SI into protein SII results
with a path g.
Recall that there are
many possibilities of
paths (2^(n+m))
 a1I a2I a3I anI 1anII
g   II
II
II
a

a

a
2
nII
 1
g  3 g  l 
The set of
alignments
  S I  S II  .
The cost matrix definition
The total energy, denoted Etotal, of
the alignment is used as a measure
to score the similarity between ,the
two proteins


l
Etotal S1  S2   E g  k  
g
k 1
The cost function
Our cost function has the
form:


score
Etotal N  X , w   xi , j  ai , s j    x j L  s j 
g
20 15
15
i 1 j 1
j 1
 xg , w .
alignment

 
Alignment coefficient

E g , w   E S  D, w  E S * H , w  x, w
g
g
HIDE SLIDE
Optimize the parameter W such that
the native energy is the lowest
Instead of solving for the
unknown optimal path, we
solve for all paths.
E g , w   0, g  1
E g , w   0, g   n
Unfortunately the number of inequality is
exponentially large.
Use statistical machine learning
algorithm
E  x, w  0  cos(x, w)  0    90o
w2
w2
constrain
X
w
Find the
middle
point in
the cone
w1

w
w1
I=0
Algorithm
Converge in
polynomial time
wi
Compute the optimal
alignment using dynamic
programming.
I=I+1
NO
Solve E>0
wi+1
|| wi -wi+1||<0.001
yes
THE END
Toy Example-New method
sequence alignment
E
Number of sets in Training
1169
Number of error 225
80
Number of iteration 6
Native: 1chg, length 245.
Chymotrypsinogen A
20
Homologue: 2alp
Sets
number
  gap   1.0911
  ai , a j   BLOSUM50
Toy Example-Old method
E
Number of sets in Training
1169
Number of error 725
Native: 1chg, length 245.
700
Chymotrypsinogen A
Homologue: 2alp
  gap   8
100
  ai , a j   BLOSUM50
Set label
LOOPP
http://ser-loopp.tc.cornell.edu/loopp.html
Query sequence
Cost model
Compute the cost of the
alignment of the Query to
proteins in LOOPP database.
Compute the Z score for a
subset of proteins with highest
scores
List of homologue
targets+alignment
Select the best prediction
using a number of scores
DESIGN SCORE PARAMETERE
Master node
Node 1
LOOPP:
NODE n
E g 1   x g 
LOOPP: E g    x g 
n
SVM
Analyze with
PERL script
SUMMARY
• We introduce a consistent method for
designing cost function parameters using
threading which enable also to design gaps
parameters.
• We overcome the problem of solving
exponentially number of inequalities using
iterative SVM formulation.
Tools to determine protein folds
• The number of new protein sequences is growing
exponentially relative to the number of protein
structures being solved by experimental methods.
• There is a need for proteins annotation tool using
sequence, and structural information.
• The method needs to be quick.
• And to give reliable answers.