Transcript SB1_3_Siu
A data-mining approach for multiple
structural alignment of proteins
WY Siu, N Mamoulis, SM Yiu, HL Chan
The University of Hong Kong
Sep 9, 2009
Aligning protein structures
Important step for understanding protein
functions
Sequencing proteins and determining
3D structures is easy
Testing functions of proteins is hard
One useful observation
X-ray crystallography, NMR spectroscopy
Mutations change sequences
Structures conserved
Structural similarity => Functional similarity
Good structural alignment algorithm =>
Predict functions of proteins
2
Our focus
We propose studying the problem with less
information or assumptions
Sequence order independence
Subset alignment
Sequence order (arrangement of amino acids) is
unknown
Reduce the need to find sequence information
Find large alignment for all subsets
Extract similar structures from a mixture with dissimilar
ones
Bottleneck metric
In an alignment, every pair of aligned points have a
small distance
3
Related work
Pairwise alignment
[Holm & Sander 93]
[Gibrat, Madej, & Bryant 96]
[Shindyalov & Bourne 98]
Techniques to obtain multiple alignments
Dali
VAST
CE
center-star
[Akutsu & Kim 99]
Tree-progress [Taylor, Flores, & Orengo 94]
Multiple alignment
MultiProt
MultiBind
MUSTA
MASS
POSA
[Shatsky et. al. 02]
[Shatsky et. al. 06]
[Leibowitz et. al. 01]
[Dror et. al. 03]
[Ye & Godzik 05]
(seq. order)
(all align.)
(all align.)
(seq. order)
(seq. order)
4
In the followings
Model
Algorithms SOIL
Experimental results
5
Model
A protein is a set of amino acid in 3D,
and an amino acid = 3 points in space
For Cα-atom, C, N
Substructure = subset of amino acid
For each s S, T(s) = Rs + t, where
R is a 3 × 3 rotation matrix
t is a 3 × 1 translation matrix
Rotate
Similarity
S2
Transformation T(S)
S1
C = {c1, …, cn} a set of substructures, T
= {T1, … Tn} be a set of transformation
C is ε-congruent w.r.t. T if we can
transform each structures in C and align
the amino acids such that the Cα items
of every aligned pairs are close (<=ε)
Translate
A ε-congruent alignment
For a set of S of structures, an alignment is
set of substructures C and transformations T
6
Problem definition
Size of an alignment: number of aligned amino
acid or each protein
Cardinality: number of structures involebed.
Input
A set of structures S = {S1, S2, …, Sm}
A distance threshold
A subset size threshold min_cardinality
An alignment length threshold min_size
Output
For each subset S’ S with |S’| min_size, the maximal
length –congruent alignment whose length is at least
min_length
7
The SOIL Algorithm
Sequence Order Independent aLignment
Step 1. Geometric hashing
Step 2. Frequent pattern mining
Step 3. Generating alignments
8
Geometric Hashing
Purpose
Take each amino acid as a base (reference) and store
the relative location of other amino acids in a hashtable.
S2
S1
1
2
1
3
2
4
3
5
5
4
Store the base
Length of box = ε
9
Mining Frequent Patterns
Main observation. Assume that a pair of bases
{(k1, i1) {k2, i2)} appears in x boxes. Then if
structures Sk1 and Sk2 are transformed using the
bases for Sk1i1 and Sk1i2, there are at least x+1
pairs of points locating closely with each other
(distance at most √3ε, i.e., diagonal length).
Proof. Why (k1,i1) is in a
box?
When Sk1 is transformed
using the base Sk1i1, an
amino acid locates at that
box
10
Mining Frequent Patterns
Let each hashbox be a coincidence group, or
transaction.
Consider all bases as items
Find all sets of items that appear frequently in
the coincidence group.
“Frequent pattern mining
problem”, a well-studied
problem in database area.
Efficient algorithms, like
fp-tree, are known
Efficient, can consider all
possible transformations
at the same time
11
Generating Alignments
Given a frequent pattern
E.g., (S12, S21)
Use the bases in a tuple to transform the structures
involved
Generate a matching of points, bipartite matching for
pairwise, greedy for multiple
Output the largest alignment
Transformed S1 and S3
Alignment
y
S 11
S 43
S
1
5
S
3
1
S 23
S
1
4
S 31
S 53
S 21
S
3
3
x
S1
S3
5
1
1
5
2
4
3
3
4
2
12
Experimental evaluation
Implemented in C++
Test cases run on Intel ® CoreTM 2 Duo
with 2.66GHz CPU and 4GB main memory
Default settings
: 3Å
min_size: 2
LRF: 3Atoms
Coincidence group: Bin
max_trans: 30Avg
13
Pairwise alignment
Length
10 pairs of proteins used before, e.g., MultiProt
SCOP and PRINT families
250
200
150
100
50
0
C-alpha match
MultiProt
MultiBind
SOIL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Test cases
Comparison of running time
C-alpha match: within a few seconds (from web)
MultiProt: 0.211s
MultiBind: 1.968s
SOIL: 0.235s
14
Multiple alignments
10 groups of proteins
Various superfamilies in SCOP, protein interfaces
from PRINT
Multiple alignments
Comparison of Lengths
200
MultiProt
MultiBind
150
SOIL
100
50
0
6 5 4 3 2
Calcium Binding
4 3 2
4-helix Bundle
5
4 3 2
3 2
10 9
Superhelix Supersandwich
8 7 6 5 4
Concanavalin
3
2
(Levels)
Comparison of Lengths (Cond't)
400
300
MultiProt
MultiBind
SOIL
200
100
0
4 3 2
6
tRNA synthetase
5 4 3 2
G-proteins
5 4 3 2
PTB domain
3 2
PRINT 45
3 2
(Levels)
PRINT 8158
16
Multiple alignment
s
Comparison of Running Time
10000
1000
100
10
1
0.1
0.01
MultiProt
MultiBind
SOIL
1
2
3
4
5
6
7
8
9
10
Test cases
17
Conclusion
Proposed a more difficult problem
Sequence order independence
Subset alignment
Adopt the bottleneck metric
Developed the SOIL algorithm
Automatically detect subsets of similar structures
Similarity measurement
Modeled as the largest common point set problem
Combination of Geometric Hashing and Frequent
Itemset Mining
Simultaneous alignment
Evaluated the algorithm with experiments
Can be combined with other methods by simply taking
the maximum.
18