Transcript Document

Identifying Structural
Motifs in Proteins
Rohit Singh
Joint work with Mitul Saha
The Big Picture: small motifs
Active Sites are preserved across proteins with
similar functions
The Big Picture: large motifs
Even bigger motifs are often conserved.
Oh, BTW…
There are two different issues here:
1. Find the best match for the motif in the
protein

Extensively studied in vision/graphics
2. Is the match “significant” ?


For small motifs a good match is more likely
What is probability of a match against a
random protein being this good ? (cf. BLAST)
What’s in it for a CS guy ?
 The problem of matching two point-sets
has many applications
 Most current algorithms geared towards
points that are indistinguishable (e.g.
points on a mesh)
 There are few rigorous results on the
significance of matches
So what have we done ?
 Towards a more rigorous approach for
scoring the quality of a match (between
motif and protein)
 Provide a method that is capable of
finding the optimum match based on
these criteria
Problem Description
 Given a motif and a protein, for each point in
the motif, find a corresponding point in the
protein.
 Given these correspondences, find the best
transformation (rotation and translation only)
of the motif that aligns it to the protein.
 Optimize over all possible correspondences
Oh, BTW…
 Given two sets of k points, easy to find
the optimal rotation and translation that
minimizes the least sum-of-squared
error (also RMSD).
 Boils down to finding the largest
eigenvalue of a 4x4 matrix.
Previous Work
 Brute Force approach: match edges of
same length.
 Geometric Hashing:
Pennec & Ayache, Bioinformatics, 1998
What is missing ?
 Ad hoc: Try to minimize a quantity that is only
indirectly related to the least square error or
RMSD.
 Hard to evaluate the quality of partial
matches
 Brute Force methods infeasible for larger
motifs
 Geometric Hashing requires significant
preprocessing
Estimating the error
Model the alignment problem as a
regression problem:
Y = model set (protein)
T = data set (motif)
g = transformation (rot+trans)
Which error criterion to use ?
• Least Mean Squared Error (also RMSD)
LSE is not good when you have outliers.
what to do ?
Robust error estimation
 LSE: larger error terms have
disproportionate influence.
 Use a function to reduce the effect of
larger error terms (M-estimators)
Its an optimization problem!
Consider the case of full matching:
Domain: set of all possible correspondences
between points on the motif and points on the
protein
Range: given a particular set of corresponding
points, the minimum error in aligning those
point sets.
Goal: find the global minimum of this function!
Looking for global minimum
Our approach:
 Prune the search space to a small and
plausible sub-space
 Find (most) of the local minima in this
sub-space quickly
 Choose the minimum over these local
minima
Finding local minima is easy:ICP
Iterative Closest Point (Besl-McKay):
ICP contd…
 ICP is guaranteed to converge to a local
minimum
 But depends a lot on initial seeding
 Convergence is quick: ~4-5 iterations
ICP movie
Pruning the search space
 Every point in motif/protein has some
features:

Amino acid type, element type, sec. structure,
hydrophobic/polar, ‘substitutable’
 Assume: a point with feature X can only
match another point with feature X (or
{Y,Z,W})
 Assume: some features are more frequent
than others
Our Approach
 Find the feature that is least frequent in
protein.
 For each occurrence of the feature:
 Seed
ICP appropriately. Find local
minimum.
 Look around a few more times
 Return the best answer you have
Observations
 Will always find a perfect match, if it
exists.

Moreover, will find such a match quickly.
 The error is directly interpretable in
RMSD terms
Does it work ?
…contd
Trypsin active site against Trypsin like proteins
…contd
Trypsin active site against kinases
What about partial matching ?
 Basic idea is the same: pruning+ICP
 Replace least squared error estimates by M-
estimator based errors.
Problem: How to find the optimal
rotation/translation that minimizes this new
variety of error criterion?
Answer: weighted LSE ?
Is there a better way ?
RANSAC
Choice of the parameters has statistical justification
Plain Vanilla (Least Squares):
M-estimator+ weighted LSE
M-estimator + RANSAC
…contd
Data for distorted trypsin active site against ten
different trypsins:
Future Work
 Test on larger motifs: secondary
structure elements
 Choice of better features
 A theoretical guarantee about the
quality of results
 Explore different criteria for partial
matching
Thanks!