Transcript PPT
Solving the Protein Threading
Problem in Parallel
Nocola Yanev, Rumen Andonov
Indrajit Bhattacharya
CMSC 838T Presentation
Motivation
Problem paper is trying to solve
3D structure prediction using threading
Is a given target sequence likely to fold to a 3D template core?
Find the alignment that minimizes some score function
NP-complete; optimal solution not possible
MAX-SNP-hard; arbitrary approximation not possible
Why do we care
3D structure determines biological function of protein
Amino acid sequence (almost) uniquely determines 3D
structure
Threading is usually less accurate than comparative modeling
but easier to solve
CMSC 838T – Presentation
Talk Overview
Overview of talk
Motivation
Techniques
Evaluation
Related work
Observations
CMSC 838T – Presentation
Techniques
Approach
Reduce the problem to some known theoretical problem of
interest
In this case, network flow
Use existing tools for solving the theoretical problem efficiently
CPLEX
Explore possibilities for parallelizing the problem
Investigate the intrinsic hardness for real biological examples
CMSC 838T – Presentation
Mathematical Formulation
Two structurally
similar proteins
Spatial adjacencies
(interactions)
Possible threading
with a sequence
Objective function
CMSC 838T – Presentation
Reduction to Network Flow: An Example
CMSC 838T – Presentation
Reduction to Network Flow:
Variables and Constraints
Standard Network Flow
Variable xi,t for each segment to position assignment
Restricted to [0, 1]
With standard flow conservation constraints
Additional cost for non-local interactions
Variable zi,t,i’,t’ for each non-local interaction
Restricted to {0, 1}
Constrained to sum to 1 for each non-local pair (i, i’)
Upper bounded by flow entering (i, t) and leaving (i’, t’)
CMSC 838T – Presentation
Drawbacks of Approach
Integer programming is hard to solve!
Relax to linear programming with (0, 1) variables
Approximate to integer solution using standard heuristics
Existing tools like CPLEX
Huge number of variables
For 36 segments and 81 positions, IP problem has 741264
rows, 360945 columns and 54145231 non-zero variables!
Need to reduce number of variables and constraints
Calls for parallelization if possible
CMSC 838T – Presentation
Parallel Solution
Utilize special flow constraints
Split into sub-problems that may be solved parallely
Split the k-th layer in the graph into r intervals
Force path for a sub-problem to pass through a particular
interval in the layer
Pass best bound for objective function found so far as
parameter to sub-problem
Sub-task aborts when dual objective function exceeds the
current best bound
CMSC 838T – Presentation
Improving Parallel Solution
Drawback: Hardest Sub-Problem Dominates!
Parallel strategy was found to be slower than the sequential!
Sub-problems can potentially become harder to solve
Many more difficult sub-problems than easy ones
Solution:
Break the atomicity of the tasks
Each sub-task periodically checks the current best bound and
updates its cut-off
Extra overhead is still small compared to task granularity
Now the easiest executing sub-task dominates!
CMSC 838T – Presentation
Evaluation
Experimental environment
Real protein sequences
ILOG CPLEX Callable Library
SUN Ultra-Sparc II, 450 Mhz
Objective function coefficients generated from FROST
Maximum of 7 processors and 29 sub-problems
Evaluation results
Sequential version much faster than previous branch-andbound results for the same problem formulation
Time taken comparable to PROSPECT
Splitting and parallelization significantly improve turnaround
Really tiny gap between relaxed LP and ILP solutions
Mostly integer solutions even for relaxed LP!
CMSC 838T – Presentation
Result Tables
Comparison with branch and bound algorithm
Comment: Self threading results in significantly
lower scores (as should be)
CMSC 838T – Presentation
Result Tables
Gap between relaxed LP and ILP
Comment: Tiny relaxation gap. (significance?)
CMSC 838T – Presentation
Result Tables
Size of the LP formulation
Comment: LP problem size is still too large.
CMSC 838T – Presentation
Result Tables
Performance with parallel sub-tasks
Comment: Longer times with more sub-problems??
CMSC 838T – Presentation
Related Work
Similar / previous approaches
Lathrop and Smith, 1998
Uses same cost function
Branch and bound algorithm for searching the space of
threadings
Xu, Xu and Uberbacher, 1998
Divide and conquer algorithm
Xu, Li, Lin, Kim and Xu, 2003
Linear programming formulation
Solved using b&b algorithm
None of the above suggest any parallelizing scheme
CMSC 838T – Presentation
Observations
Points of Interest
Mapping to a known problem of interest
Nicely utilizes particular constraints to break into independent
subtasks
Threading of real amino acid sequences seems possible
Raises interesting questions about real-life protein threading
being in P
Solver tailored for this particular problem may yield better
results
CMSC 838T – Presentation
Observations
Criticism
Not enough experiments with large number of subtasks and
processors to show scaling
Prohibitively large number of variables and constraints
How accurate are the objective function coefficients?
What is the resolution of the objective function?
Threading onto multiple sequences for prediction still looks
daunting
Not clear how to extend the idea for 3-way and more complex
interactions
Improvements
Seems possible to break up the sub-tasks recursively
CMSC 838T – Presentation
Thank you!
CMSC 838T – Presentation