Finding Compact Structural Motifs
Download
Report
Transcript Finding Compact Structural Motifs
Finding Compact Structural
Motifs
Presented By: Xin Gao
Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu,
Ming Li, and Jinbo Xu
University of Waterloo, Ontario, Canada
[email protected]
Outline
Introduction to Structural Motif
Related Work
Compact Motif-finding Problem Formulation
NP-Hard of the Compact Motif-finding Problem
A Polynomial Time Approximate Scheme
Outline
Introduction to Structural Motif
Related Work
Compact Motif-finding Problem Formulation
NP-Hard of the Compact Motif-finding Problem
A Polynomial Time Approximate Scheme
Introduction
Protein is a sequence of amino acids.
A protein always folds into a specific
3-D shape.
Structures are important to proteins:
The functional properties of proteins
depend on their 3-D structures.
Structures are more conserved than
sequence during the evolution of proteins.
Structural Motif
Structural motif is a frequently
occurring substructure of proteins.
Motifs are thought to be tightly
related to protein functions.
Identifying motifs from a set of
proteins can help us to know their
evolutionary history and functions.
Structural Motif Finding Problem
Given a set of protein structures, to find the
frequently occurring substructure.
Informally, to find one substructure from each
protein, that exhibit the highest degree of similarity.
How to measure the similarity of two
substructures?
Two popular measurements:
dRMSD: measure the root mean square Euclidean
distance between the corresponding
residues from different protein structures.
cRMSD: calculate the internal distance matrix for
each protein, and compare the distance
matrices for input structures.
Outline
Introduction to Structural Motif
Related Work
Compact Motif-finding Problem Formulation
NP-Hard of the Compact Motif-finding Problem
A Polynomial Time Approximate Scheme
Related Work
L.P.Chew proposed an iterative algorithm to
compute the conserved shape and proved its
convergence. (2002)
D. Bandyopadhyay applied graph-based datamining tools to find the family-specific fingerprints.
(2006)
M. Shatsky presented an algorithm to uncover the
binding pattern. (2006)
DALI and CE attempt to identify structural alignment
with minimal dRMSD.
STRUCTRAL and TM-Align employ heuristics to
detect the alignment with minimal cRMSD.
Related Work (continued)
However, these methods are all heuristic; the
solutions are not guaranteed to be optimal or near
optimal.
The first PTAS for pairwise structural alignment:
R. Kolodny explored the Lipschitz property of the scoring
function. (2004)
Though this algorithm can be extended to the case
of multiple structure alignment, the simple extension
has a time complexity exponential in the number of
proteins.
Is there a PTAS to multiple structure motif finding?
Outline
Introduction to Structural Motif
Related Work
Compact Motif-finding Problem Formulation
NP-Hard of the Compact Motif-finding Problem
A Polynomial Time Approximate Scheme
We focus on (R, C)-Compact Motif.
What is (R, C)-compact motif?
A motif is bounded in a minimum ball with radius R.
In this ball, at most C residues do not belong to this motif.
(R,C)-compact motif is biologically meaningful since
We focus on globular proteins.
We allows at most C exceptions.
(R, C)-Compact Motif Finding Problem
Input: n protein structures S1 …, Sn, and length l
Output: a consensus consists of l 3D points
Objective:
q=(q1, …, ql )
a substructure ui from each protein Si
min (1 in d2(q, ui))1/2
Here, we adopt the dRMSD distance function, i.e.,
d(q, ui)=min||q- (ui)||2
consists of a rotation and a translation
||*||2 is the Euclidean metric.
Outline
Introduction to Structural Motif
Related Work
Compact Motif-finding Problem Formulation
NP-Hard of the Compact Motif-finding Problem
A Polynomial Time Approximate Scheme
(R,C)-compact motif finding is still NP-Hard.
Reduction from the Sequence Consensus Problem
Input: n binary strings S1, …, Sn, each is of length m
Output: A substring ti of length l from each string Si, 1i n,
Objective: minimize 1 i <i’ n dH(ti, ti’), where dH is
Hamming distance.
Basic Idea:
Try to find a way of reduction to make:
dRMSD=Hamming Distance
(R,C)-compact motif finding is still NP-Hard.
Each l-mer is
transformed
into 6l 3D
points.
110 110
001 000000
111111
0(0, 2i, 0),
1(1, 2i, 0)
(R,C)-compact motif finding is still NP-Hard.
Each l-mer is transformed into 6l 3D points.
110 110 001 000000 111111
0(0, 2i, 0), 1(1, 2i, 0)
The centroid will be (1/2, 2i, 0) (Easy translation)
Large “tail” no rotation
RMSD = Hamming Distance
Small distortion to each point to make it proteinlike.
Sequence Consensus Problem (1,0)Compact Motif Finding Problem
Outline
Introduction to Structural Motif
Related Work
Compact Motif-finding Problem Formulation
NP-Hard of the Compact Motif-finding Problem
A Polynomial Time Approximate Scheme
The Basic Idea of Our PTAS
There are always a few “important” substructures, whose consensus holds most of
the “secrets” of the true optimal motif.
Therefore, if we can simply do exhaustive
search to find these few sub-structures, then
the trivial optimal solution for these substructures is a good approximation to the real
optimal solution.
Technique 1: Sampling
We sample only r proteins,
consider each motif in a sampled protein,
we can say we almost know the optimal
solution.
Sampling will introduce only a bit of error.
There is at least one selection schema,
whose consensus has a cost value less than
(1+1/r)OPT.
So, we can find this schema by simply
enumerating operation.
Technique 2: Discretize the Rotation
Space
Each rotation is parameterized by three
angles
1, 2, 3[0, 2)
Discretize the angles with step size ’
we get an ’-rotation net.
Discretized rotation will not introduce a
large error, either.
A parameterized algorithm for protein
structure alignment. J. Xu, F. Jiao, and B.
Berger. RECOMB2006.
PTAS
PTAS
Performance Ratio Analysis
Running Time
Each protein contains M motifs
M is a polynomial of protein length
M O(C ml ) O(m l )
Each motif can adopt W rotations
W depends on the constant
So the number of consensus is less than
O(nr(MW)r)= O((nMW)r)
Conclusion and Future Work
We prove the (R,C)-compact motif finding
problem is NP-hard
We obtain a PTAS for this problem.
Future Work:
Further reduce the time complexity
Design some practical algorithms.
Solve a more general case.
Thank You. Questions…