Transcript PPT

Solving the Protein Threading
Problem in Parallel
Nocola Yanev, Rumen Andonov
Indrajit Bhattacharya
CMSC 838T Presentation
Motivation


Problem paper is trying to solve

3D structure prediction using threading

Is a given target sequence likely to fold to a 3D template core?

Find the alignment that minimizes some score function

NP-complete; optimal solution not possible

MAX-SNP-hard; arbitrary approximation not possible
Why do we care

3D structure determines biological function of protein

Amino acid sequence (almost) uniquely determines 3D
structure

Threading is usually less accurate than comparative modeling
but easier to solve
CMSC 838T – Presentation
Talk Overview

Overview of talk

Motivation

Techniques

Evaluation

Related work

Observations
CMSC 838T – Presentation
Techniques

Approach

Reduce the problem to some known theoretical problem of
interest

In this case, network flow
Use existing tools for solving the theoretical problem efficiently

CPLEX
Explore possibilities for parallelizing the problem

Investigate the intrinsic hardness for real biological examples


CMSC 838T – Presentation
Mathematical Formulation
Two structurally
similar proteins
Spatial adjacencies
(interactions)
Possible threading
with a sequence
Objective function
CMSC 838T – Presentation
Reduction to Network Flow: An Example
CMSC 838T – Presentation
Reduction to Network Flow:
Variables and Constraints


Standard Network Flow

Variable xi,t for each segment to position assignment

Restricted to [0, 1]

With standard flow conservation constraints
Additional cost for non-local interactions

Variable zi,t,i’,t’ for each non-local interaction

Restricted to {0, 1}

Constrained to sum to 1 for each non-local pair (i, i’)

Upper bounded by flow entering (i, t) and leaving (i’, t’)
CMSC 838T – Presentation
Drawbacks of Approach


Integer programming is hard to solve!

Relax to linear programming with (0, 1) variables

Approximate to integer solution using standard heuristics

Existing tools like CPLEX
Huge number of variables

For 36 segments and 81 positions, IP problem has 741264
rows, 360945 columns and 54145231 non-zero variables!

Need to reduce number of variables and constraints

Calls for parallelization if possible
CMSC 838T – Presentation
Parallel Solution

Utilize special flow constraints

Split into sub-problems that may be solved parallely

Split the k-th layer in the graph into r intervals

Force path for a sub-problem to pass through a particular
interval in the layer

Pass best bound for objective function found so far as
parameter to sub-problem

Sub-task aborts when dual objective function exceeds the
current best bound
CMSC 838T – Presentation
Improving Parallel Solution


Drawback: Hardest Sub-Problem Dominates!

Parallel strategy was found to be slower than the sequential!

Sub-problems can potentially become harder to solve

Many more difficult sub-problems than easy ones
Solution:

Break the atomicity of the tasks

Each sub-task periodically checks the current best bound and
updates its cut-off

Extra overhead is still small compared to task granularity

Now the easiest executing sub-task dominates!
CMSC 838T – Presentation
Evaluation


Experimental environment

Real protein sequences

ILOG CPLEX Callable Library

SUN Ultra-Sparc II, 450 Mhz

Objective function coefficients generated from FROST

Maximum of 7 processors and 29 sub-problems
Evaluation results

Sequential version much faster than previous branch-andbound results for the same problem formulation

Time taken comparable to PROSPECT

Splitting and parallelization significantly improve turnaround

Really tiny gap between relaxed LP and ILP solutions

Mostly integer solutions even for relaxed LP!
CMSC 838T – Presentation
Result Tables
Comparison with branch and bound algorithm
Comment: Self threading results in significantly
lower scores (as should be)
CMSC 838T – Presentation
Result Tables
Gap between relaxed LP and ILP
Comment: Tiny relaxation gap. (significance?)
CMSC 838T – Presentation
Result Tables
Size of the LP formulation
Comment: LP problem size is still too large.
CMSC 838T – Presentation
Result Tables
Performance with parallel sub-tasks
Comment: Longer times with more sub-problems??
CMSC 838T – Presentation
Related Work


Similar / previous approaches

Lathrop and Smith, 1998
 Uses same cost function
 Branch and bound algorithm for searching the space of
threadings

Xu, Xu and Uberbacher, 1998
 Divide and conquer algorithm

Xu, Li, Lin, Kim and Xu, 2003
 Linear programming formulation
 Solved using b&b algorithm
None of the above suggest any parallelizing scheme
CMSC 838T – Presentation
Observations

Points of Interest

Mapping to a known problem of interest

Nicely utilizes particular constraints to break into independent
subtasks

Threading of real amino acid sequences seems possible

Raises interesting questions about real-life protein threading
being in P

Solver tailored for this particular problem may yield better
results
CMSC 838T – Presentation
Observations


Criticism

Not enough experiments with large number of subtasks and
processors to show scaling

Prohibitively large number of variables and constraints

How accurate are the objective function coefficients?

What is the resolution of the objective function?

Threading onto multiple sequences for prediction still looks
daunting

Not clear how to extend the idea for 3-way and more complex
interactions
Improvements

Seems possible to break up the sub-tasks recursively
CMSC 838T – Presentation
Thank you!
CMSC 838T – Presentation