Transcript SB1_3_Siu

A data-mining approach for multiple
structural alignment of proteins
WY Siu, N Mamoulis, SM Yiu, HL Chan
The University of Hong Kong
Sep 9, 2009
Aligning protein structures


Important step for understanding protein
functions
Sequencing proteins and determining
3D structures is easy



Testing functions of proteins is hard
One useful observation




X-ray crystallography, NMR spectroscopy
Mutations change sequences
Structures conserved
Structural similarity => Functional similarity
Good structural alignment algorithm =>
Predict functions of proteins
2
Our focus


We propose studying the problem with less
information or assumptions
Sequence order independence



Subset alignment



Sequence order (arrangement of amino acids) is
unknown
Reduce the need to find sequence information
Find large alignment for all subsets
Extract similar structures from a mixture with dissimilar
ones
Bottleneck metric

In an alignment, every pair of aligned points have a
small distance
3
Related work

Pairwise alignment




[Holm & Sander 93]
[Gibrat, Madej, & Bryant 96]
[Shindyalov & Bourne 98]
Techniques to obtain multiple alignments



Dali
VAST
CE
center-star
[Akutsu & Kim 99]
Tree-progress [Taylor, Flores, & Orengo 94]
Multiple alignment





MultiProt
MultiBind
MUSTA
MASS
POSA
[Shatsky et. al. 02]
[Shatsky et. al. 06]
[Leibowitz et. al. 01]
[Dror et. al. 03]
[Ye & Godzik 05]
(seq. order)
(all align.)
(all align.)
(seq. order)
(seq. order)
4
In the followings



Model
Algorithms SOIL
Experimental results
5
Model

A protein is a set of amino acid in 3D,
and an amino acid = 3 points in space





For Cα-atom, C, N
Substructure = subset of amino acid
For each s  S, T(s) = Rs + t, where
R is a 3 × 3 rotation matrix
t is a 3 × 1 translation matrix
Rotate
Similarity



S2
Transformation T(S)


S1
C = {c1, …, cn} a set of substructures, T
= {T1, … Tn} be a set of transformation
C is ε-congruent w.r.t. T if we can
transform each structures in C and align
the amino acids such that the Cα items
of every aligned pairs are close (<=ε)
Translate
A ε-congruent alignment

For a set of S of structures, an alignment is
set of substructures C and transformations T
6
Problem definition



Size of an alignment: number of aligned amino
acid or each protein
Cardinality: number of structures involebed.
Input





A set of structures S = {S1, S2, …, Sm}
A distance threshold 
A subset size threshold min_cardinality
An alignment length threshold min_size
Output

For each subset S’  S with |S’|  min_size, the maximal
length –congruent alignment whose length is at least
min_length
7
The SOIL Algorithm

Sequence Order Independent aLignment



Step 1. Geometric hashing
Step 2. Frequent pattern mining
Step 3. Generating alignments
8
Geometric Hashing

Purpose

Take each amino acid as a base (reference) and store
the relative location of other amino acids in a hashtable.
S2
S1
1
2
1
3
2
4
3
5
5
4


Store the base
Length of box = ε
9
Mining Frequent Patterns



Main observation. Assume that a pair of bases
{(k1, i1) {k2, i2)} appears in x boxes. Then if
structures Sk1 and Sk2 are transformed using the
bases for Sk1i1 and Sk1i2, there are at least x+1
pairs of points locating closely with each other
(distance at most √3ε, i.e., diagonal length).
Proof. Why (k1,i1) is in a
box?
When Sk1 is transformed
using the base Sk1i1, an
amino acid locates at that
box
10
Mining Frequent Patterns






Let each hashbox be a coincidence group, or
transaction.
Consider all bases as items
Find all sets of items that appear frequently in
the coincidence group.
“Frequent pattern mining
problem”, a well-studied
problem in database area.
Efficient algorithms, like
fp-tree, are known
Efficient, can consider all
possible transformations
at the same time
11
Generating Alignments

Given a frequent pattern




E.g., (S12, S21)
Use the bases in a tuple to transform the structures
involved
Generate a matching of points, bipartite matching for
pairwise, greedy for multiple
Output the largest alignment
Transformed S1 and S3
Alignment
y
S 11
S 43
S
1
5
S
3
1
S 23
S
1
4
S 31
S 53
S 21
S
3
3
x
S1
S3
5
1
1
5
2
4
3
3
4
2
12
Experimental evaluation
Implemented in C++
 Test cases run on Intel ® CoreTM 2 Duo
with 2.66GHz CPU and 4GB main memory
 Default settings






: 3Å
min_size: 2
LRF: 3Atoms
Coincidence group: Bin
max_trans: 30Avg
13
Pairwise alignment

Length

10 pairs of proteins used before, e.g., MultiProt
SCOP and PRINT families
250
200
150
100
50
0
C-alpha match
MultiProt
MultiBind
SOIL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Test cases

Comparison of running time




C-alpha match: within a few seconds (from web)
MultiProt: 0.211s
MultiBind: 1.968s
SOIL: 0.235s
14
Multiple alignments


10 groups of proteins
Various superfamilies in SCOP, protein interfaces
from PRINT
Multiple alignments
Comparison of Lengths
200
MultiProt
MultiBind
150
SOIL
100
50
0
6 5 4 3 2
Calcium Binding
4 3 2
4-helix Bundle
5
4 3 2
3 2
10 9
Superhelix Supersandwich
8 7 6 5 4
Concanavalin
3
2
(Levels)
Comparison of Lengths (Cond't)
400
300
MultiProt
MultiBind
SOIL
200
100
0
4 3 2
6
tRNA synthetase
5 4 3 2
G-proteins
5 4 3 2
PTB domain
3 2
PRINT 45
3 2
(Levels)
PRINT 8158
16
Multiple alignment
s
Comparison of Running Time
10000
1000
100
10
1
0.1
0.01
MultiProt
MultiBind
SOIL
1
2
3
4
5
6
7
8
9
10
Test cases
17
Conclusion

Proposed a more difficult problem

Sequence order independence


Subset alignment


Adopt the bottleneck metric
Developed the SOIL algorithm



Automatically detect subsets of similar structures
Similarity measurement


Modeled as the largest common point set problem
Combination of Geometric Hashing and Frequent
Itemset Mining
Simultaneous alignment
Evaluated the algorithm with experiments

Can be combined with other methods by simply taking
the maximum.
18