Transcript Slide 1

Alignment of Flexible Protein
Structures
Based on:
FlexProt: Alignment of Flexible Protein Structures
Without a Pre-definition of Hinge Regions /
M. Shatsky, R. Nussinov, H. Wolfson
Presented by: Einat Engel
1
Introduction
Proteins are flexible structures
2
Outline
Introduction
• Proteins (reminder)
• Protein motion
• Structural alignment – rigid & flexible
General Description:
• Problem’s description
• Discussion
3
Outline
Detailed Description:
• FPSA problem description
• FlexProt algorithm for the FPSA problem
• Experimental results
• Heuristic Algorithm for FPSA
• Clustering
Conclusions & Discussion:
• Summary of algorithm
• Major results
• Discussion
4
Reminder: Protein Structure
Proteins are made up of 20 different amino acids
(or "residues").
Different levels of protein structures:
• Primary – amino acid sequence
• Secondary – local folding of amino acid chains
• Tertiary – 3D structure of a protein
• Quaternary – forming multi-chained proteins
5
Reminder: Protein Structure
Primary structure
Tertiary structure
6
lysozyme
Flexibility & Protein Motion
• Proteins are flexible molecules that
undergo significant structural changes as
part of their normal function.
• Motion often serves as an essential link
between structure and function.
7
Flexibility & Protein Motion
• Protein motions are involved in numerous
basic functions. In fact, highly mobile
proteins have been implicated in a number
of diseases, e.g., the motion of gp41 in
AIDS
8
Structural Alignment
• When flexible molecules are compared to
each other as rigid bodies, even strong
similarities can be missed
• Yet, most existing protein alignment
algorithms treat them as rigid objects
• We’ll see a technique for the alignment of
flexible proteins
9
The Goal
10
Go back
Existing Approaches – Rigid Structural
Alignment
• Exhaustive 3D search – search all
possible rotations. (Matthews & Rossman)
• Fragment alignment – comparison of
contiguous fragments.
• Geometric Hashing – Local reference
frame, preprocessing & recognition (Fischer)
• Curve Matching – match curves using
Fourier Transform (Schwartz & Sharir)
11
Existing Approaches – Flexible Structural
Alignment
• Domain detection – requires a-priori
knowledge of the corresponding pairs of
amino-acid residues (Wriggers & Schulten)
• Geometric hashing – requires a-priori
knowledge of the hinge location (Verbitsky)
• Data base screening – requires a-priori
knowledge of hinges (Rigoutsos)
12
Outline
Introduction
• Proteins (reminder)
• Protein motion
• Structural alignment – rigid & flexible
General Description:
• Problem’s description
• Discussion
13
Terminology
Two fragments are almost congruent (matched) if:
1. Their sequence length is the same.
2. There exists a 3D rotation and translation
which superimposes the corresponding atoms
with small RMSD.
(Reminder: RMSD measures alignment error.)
14
Problem Definition
• Input: two protein molecules M1 and M2.
• Task: divide the two molecules into
fragments of maximal size, such that the
matched fragments will be almost
congruent.
15
Problem Discussion
• The regions between the fragments are
Example
called flexible (hinge) regions.
• We’d like to minimize the number of
flexible regions and maximize the
Conflict!
alignment size
• Our goal is to find a balanced solution
16
Problem Discussion
Consider two different solutions:
I. 3 rigid parts. Total size = 200 atoms
II. 2 rigid parts. Total size = 150 atoms
Q: Which is better?
A: I don’t know. Let’s divide the results
according to the number of rigid parts.
17
Major Results
• Introducing FlexProt, a new technique for
the alignment of flexible proteins.
• Unlike other algorithms, FlexProt does not
require a priori knowledge of the locations
of the flexible, hinge-bending sites
• The pairs of rigid matching fragments and
the flexible regions are detected
simultaneously
18
Outline
Detailed Description:
• FPSA problem description
• FlexProt algorithm for the FPSA problem
• Experimental results
• Heuristic Algorithms for FPSA
• Clustering
Conclusions & Discussion:
• Summary of algorithm
• Major results
• Discussion
19
Flexible Protein Structural Alignment (FPSA)
Input
• Two proteins, M 1   v1 ,
vi , u j 
, vn  , M 2   u1 ,
3
• Threshold error MaxRMSD
• MaxFlexNum parameter
• A weight function w
20
, um 
FPSA Problem Terminology
A rigid fragment pair is defined as:
Fk1 Ft 2  l     vk , vk 1 ,
, ut l 1  
, vk l 1  ,  ut , ut 1 ,
vi  M 1 , u j  M 2
T is a 3D rigid
transformation,
meaning rotation
and translation
and has the following property:
RMSDT  Fk1 Ft 2  l    MaxRMSD
Where RMSDT is defined as follows:
RMSDT  F F
1 2
k t
21
l  

min
T
l 1
i 0
Tut i  vk i
l
2
FPSA Problem Terminology
Let FJ   f1, , fs  be a list of rigid fragment pairs
s
1 2
where fi  Fk Ft li , such that ki  ki 1 , ti  ti 1 i  1, , s 1
i
i
s 1
Let W  FJ    w  fi , fi 1 
s
i 1
w is a weight function that reflects the “goodness”
of linking two rigid fragment pairs.
22
The FPSA Problem
Example:
f1  F Ft1 l1 
1
k1
2
f2  F Ft2 l2 
1
k2
23
2
The FPSA Problem
For Each s, s  2,
, MaxFlexNum
*
Js
detect F such that:
   W  F  , F
*
Js
W F
Remember,
24
FJs   f1,
Js
Js
, f s  is a list of rigid fragment pairs
The FlexProt Algorithm for FPSA
I.
Detection of all rigid fragment pairs,
F F  l 
1 2
k t
that satisfy the MaxRMSD constraint
II. Detection of optimal configurations between
rigid fragment pairs,
25
F 
*
Js
MaxFlexNum
s 2
I. Detect all Rigid Fragment Pairs

1
In order to find all possible pairs, Fk Ft
Iterate over three indices
k  1 n; t  1 m; l  3
2
 l , do:
 k , t, l  where
min  n  k 1, m  t 1
and select the pairs satisfying
RMSDT  Fk1 Ft 2  l    MaxRMSD
26
I. Complexity
Remember, a rigid fragment pair -
F F  l 
1 2
k t
k  1 n; t  1 m; l  3 min  n  k 1, m  t 1
We assume that
nm
• Iterate over  k , t , l 
O  n3 
• Compute RMSD for each triplet – linear in the
detected fragment size (Sharir)
3
4
O
n
O
n
Total complexity    
27
O  n
II. Detect Optimal Configuration
• Now, we have a set of congruent fragment
pairs.
• Let’s find an optimal subset of it. This
subset will describe an alignment of M2
with M1. We’ll use dynamic programming
Dynamic programming – solves optimization
problem by caching subproblem solutions
rather than recomputing them.
28
II. Detect Optimal Configuration
• In General: define a graph
• Vertices represent the rigid fragment pairs
• The directed edges represent flexible
regions connecting the rigid fragment pairs
• A weight function w is applied to the
edges. it reflects the goodness of
connecting two rigidly matched fragment
pairs
29
II. Detect Optimal Configuration

Vertices V  Fi F
1
2
j
 l 
1
i
2
j
A directed edge between F F
is defined if:
l  and F F l
1
i
2
j
1. The fragments are ascending i  i, j  j
2. The gaps between consecutive fragments are
limited by MaxGap1 and MaxGap2 (user defined)
Gap1  i  i  l   MaxGap1
Gap2   j   j  l   MaxGap2
30
II. Detect Optimal Configuration
M1   v1, v2
v7 
M2  u1, u2
u8 
A  F F  2   v1, v2  , u1, u2 
B  F F 3   v3 , v4 , v5  , u4 , u5 , u6 
C  F F  2   v6 , v7  , u7 , u8 
1 2
1 1
1 2
3 4
1 2
6 7
Define:
MaxGap1=3
MaxGap2=3
C
A
B
31
II. Detect Optimal Configuration
The weight function (smaller is better):
w  e      l  1      max  Gap1 , Gap2   Gap1  Gap2
2
A
B
C
Δ is half of the maximal overlapping interval
1 2
F
 Part A rewards quadratically the size of  i F j  l 
 Part B punishes large gaps
 Part C punishes difference between Gap1 and Gap2
32
II. Detect Optimal Configuration
A  F F  2   v1, v2  , u1, u2 
B  F F 3   v3 , v4 , v5  , u4 , u5 , u6 
C  F F  2   v6 , v7  , u7 , u8 
1 2
1 1
1 2
3 4
1 2
6 7
w  e      l  1      max  Gap1 , Gap2   Gap1  Gap2
2
A
B
C
w  e1     3  0   max  0,1  0  1  7
2
w  e2   16
e2
A
33
C
e1
B
II. Detect Optimal Configuration
• We built a weighted directed acyclic graph
(DAG)
• Shortest weighted paths correspond to
alignments of consecutive, long, congruent
matching fragments.
Almost
Finished
34
Reminder: Shortest Paths in DAGs
First, we perform a topological sort of the Directed
Acyclic Graph (DAG).
Then, we make just one pass over the vertices
according to their order. For each vertex, we relax
each edge that leaves it.
1
6
0
2
∞
2
7
∞
6
4
35
-1
∞
6
5
-2
2
∞
4
3
II. Detect Optimal Configuration
• We run the Shortest Paths in DAGs algorithm.
• A simple case (no limit on the number of nodes
in the shortest path): The shortest path in G
corresponds to a minimal weighted sequence of
rigid fragment pairs, F**, such that
 
W  F **  W FJ s FJ s
• Complexity -
36
OV  E 
II. Detect Optimal Configuration
• We’ll make a small change in the algorithm since
we need to detect shortest paths with exactly s
nodes, s  2, , MaxFlexNum
• In the simple case, each node holds a pointer to
a preceding node with the shortest path.
• Instead, each node will hold MaxFlexNum
pointers. Pointer s points to a preceding node
with a shortest path of size s-1
37
II. Complexity
During Relaxation, we check all MaxFlexNum
possibilities and therefore the complexity is
  V  MaxFlexNum * E 
• The number of nodes in the graph can be
proportional to O n 3
• Graph of n vertices has O n 2 edges
 
 
Total complexity of stage II: O  n 
6
38
Summary of FlexProt Algorithm
• Theoretical worst case complexity is O  n 
• In practice – FlexProt is highly efficient
(with some changes)
• The average running time is approximately
seven seconds (for molecules of 300
amino acids)
So… What
6
does it look
like??
39
Experimental
Results
Experimental
Results
40
Experimental Results
41
Running FlexProt
• http://bioinfo3d.math.tau.ac.il/FlexProt
• http://www.umass.edu/microbio/chime/expl
orer/pe.htm
42
Heuristic Improvement of Step I
• In step I, we detected all of the rigid
fragment pairs. Time complexity – O  n 3 
• The procedure takes several minutes,
even for small proteins.
• Instead, we can use a greedy algorithm,
that only takes O  n 2 
43
Heuristic Improvement of Step I
• Start by aligning a single matching atom
pair  va , ub  where va  M1 and ub  M 2
• Iteratively, add one matching atom pair to
the left and one to the right.
• Stop when we exceed the RMSD
threshold – when the list can’t be extended
to the left or the right.
44
Heuristic Improvement of Step I
After the extension process, we have a match-list
v , u  , , va , ub  , , v
i
 vi ,
j
i l 1
, u j l 1 
, vi l 1  is almost congruent to  u j ,
The next alignment is
initiated at vi  l 1 , u j  l 1

and not  va1 , ub1 

a-1
j+l-1
a+1
b-1
j
i
45
, u j l 1 
a
b
i+l-1
Complexity
Updating the RMSD at each step is O 1
Thus, finding a particular F F  l 
is linear in the length of the fragments - O  l 
1
i
2
j
The time complexity, T1* is:
T 
*
1
46

Fi1F j2  l 
1
Time to compute Fi F
2
j
l 
Complexity
T1* 

Fi1F j2  l 
O l  


 vi , u j 
ij
* O  n   n 2 * O  n   O  n3 
Theoretically, some atom pairs  vi , u j , might
participate in at most n fragment pairs.
In practice, a pair of atoms  vi , u j  participates
in at most 2 fragment pairs.
47
T O n
*
1
2

There are
O(n2) rigid
fragment
pairs
Clustering
• This stage can be viewed as an extension
to the FPSA problem.
• The algorithm clusters consecutive
fragment pairs, that have a similar 3D
transformation, even if they are not directly
linked.
48
Clustering
Example: Two β-strands (A and B) are connected
by loops of different lengths.
Stage I of the FlexProt
algorithm aligns each separately.
A and B have almost the same 3D rigid
transformation and in the clustering stage, they
are joined into one structure.
49
The Clustering Algorithm
• We take each path detected in stage II.
Remember, vertices = congruent fragment pair
• The first vertex is a singleton cluster
• Take the second vertex. Check if there is a rigid
transformation which superimposes both
fragments.
 If successful – do the same for the next vertex
 Else – start a new cluster with the vertex that
failed to join the previous cluster
50
Clustering Complexity
• The number of iterations equals the
number of rigid fragment pairs in the
flexible alignment solution.
• Time complexity: T3  O  MaxNumFlex * N flex 
• N flex is the number of flexible alignments.
2
It is bounded by O  MaxNumFlex * n 
• T3  O  MaxNumFlex 2 * n 2 
n vertices
T3  O  n
51
2

2
Running FlexProt
• http://bioinfo3d.math.tau.ac.il/FlexProt
• http://www.umass.edu/microbio/chime/expl
orer/pe.htm
52
Outline
Detailed Description:
• FPSA problem description
• FlexProt algorithm for the FPSA problem
• Experimental results
• Heuristic Algorithms for FPSA
• Clustering
Conclusions & Discussion:
• Summary of algorithm
• Major results
• Discussion
53
Summary of Algorithm
Exact solution of FPSA
O(n )
Step I (detection of fragment pairs)
Step II (detection of optimal configuration)
Heuristic solution
O(n )
O(n
)
Step I
Step II
O(n )
3
6
2
4
Clustering
54
O(n2)
Major Results
• Unlike other algorithms, FlexProt does not
require a priori knowledge of the locations
of the flexible, hinge-bending sites
• The pairs of rigid matching fragments and
the flexible regions are detected
simultaneously
• The speed of the method allows extensive
database comparison.
55
Significance of Results
• Proteins are flexible. They may appear in
different conformations. FlexProt
incorporates flexibility in structure
comparison.
• Proteins function through binding.
Flexibility is one of the characters of
binding sites. So, it is important to detect
hinge-bending sites.
56
Significance of Results
• Comparing proteins despite the motion
that they have undergone is helpful for
protein classification
• These comparisons are also useful in drug
design, detecting binding sites, and the
range of motions that proteins display
57
FlexProt – Discussion
• Differs from other flexible alignment
algorithms (and of course, rigid)
• Does not violate the protein sequence
order
• Given two alignments, each giving better
results in different measures, which is
better?
• Clustering is optional.
• Which proteins are compared?
58
Websites
PDB –
http://www.rcsb.org/pdb/
SCOP –
http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.d.bbh.b.b.be.html
Database of motions –
http://molmovdb.mbb.yale.edu/molmovdb/
59
Bibliography
• Cormen, “Introduction to Algorithms”, chapter 24, Single
Shortest Paths
• Gerstein, Database of molecular Movement
• Shatsky, Nussinov and Wolfson, “Flexible Protein
Alignment and Hinge Detection”, Proteins: Structure,
function and genetics: 48, 242-256 (2002)
• Wolfson, a “Structural Bioinformatics – 2003”
presentation
60