Transcript PPT
High-Performance Algorithm
Engineering for Computational
Phylogenetics [B Moret, D Bader]
Kexue Liu
CMSC 838 Presentation
Motivation
Phylogeny reconstruction from molecular data
Poses complex optimization problem
NP hard and thus computationally intractable
High performance Algorithm Engineering
Reduce the running time of existing phylogenetic algoritms
CMSC 838T – Presentation
Talk Overview
Overview of talk
Background
Breakpoint Phylogeny
Breakpoint Analysis
Re-Engineering Techniques
Impact in computational Biology
Observations
CMSC 838T – Presentation
Background
Algorithm Engineering
Transform a pencil-and-paper algorithm into an efficient, robust
implementation.
Main focus is experimentation
High Performance Algorithm Engineering
Running time and quality of the solution as the paramount goal
Includes parallelism
Refining serial part of the code
Cache-aware programming is a key to performance
CMSC 838T – Presentation
Background
Phylogeny
Reconstruction of the evolutionary history of a collection of
organisms
Takes the form of an evolutionary tree
Computational Phylogenetics
Is extremely computation-intensive
Methods for sequence data (RNA, DNA, amino acid, Protein) do not
scale up to whole genome
Genome level data
a)
At this level, evolution is slow
b)
Enable us to recover deep evolutionary relationships
c)
Much hard to analyze than sequence data
Optimization criteria
a)
Heuristics
b)
Parsimony criterion
c)
Maximum likelihood
CMSC 838T – Presentation
Breakpoint Phylogeny
Deal with simple genomic data
Organisms have a single chromosome or contain singlechromosome organelles
Each chromosome can be represented by an ordering of oriented
genes.
Evolutionary process includes inversion, transposition, insertion,
deletion and duplication.
Approaches
Construct parsimonious tree
Known or conjectured to be NP hard
b)
No automated tool to solve it
Neighbor-joining heuristics
Fast and valuable
b)
Can’t recover the ancestral gene orders.
Breakpoint phylogeny by Blanchette and Sankoff.
a)
a)
CMSC 838T – Presentation
Breakpoint phylogeny
More special case:
All the genomes have the same set of genes
Each gene appears once.
Is of interest to biologists
Inversions are the main evolutionary mechanism on such
genomes
Works well for certain datasets.
Implementation developed by Sankoff and Blanchette
Breakpoint Analysis
Too slow to be used on anything other than small datasets with
a few genes.
CMSC 838T – Presentation
Breakpoint Analysis: Details
Breakpoint:
Two genomes G and G’ with the same set of genes and each
gene appears exactly once in each genome
Ordered pair of genes, (gi , gj) appears in G
Neither (gi , gj) nor (-gj , -gi) appears in G’
Breakpoint Distance
Median for three genomes
Number of breakpoints between two genomes.
The genome which minimizes the breakpoint distance
Median Problem for Breakpoints
Construct a median of given genomes
NP hard
CMSC 838T – Presentation
Breakpoint Analysis
Method developed by Sankoff and Blanchette to solve
breakpoint phylogeny
Uses reduction from MPB to Travelling Salesman
Problem
Directed MPB to undirected TSP
Representing each gene by a pair of cities connected by an
edge
Outer loop enumerates all (2n-5)!! trees on n leaves
Inner loop runs unknown number of iterations
Computation complexity is exponential in each of the
number of genomes and the number of genes.
CMSC 838T – Presentation
Breakpoint Analysis
Initially label all internal nodes with gene orders
Repeat
For each internal node v, with neighbors A, B, C do
Solve the MPB on A, B, C to yield label m
If relabelling v with m improves the score of T,
then do it
until no internal node can be relabelled
CMSC 838T – Presentation
Re-Engineering Techniques
Profiling:
Identify bottlenecks to balance implementation
Eliminate problems which include excessive resource
consumption or poor results.
Examples:
Hand-unrolling loops, cut the running time down by a
factor at least six.
b.
Refine distance computations
c.
Refine lower bound computations
Speed-up by one order of magnitude on Campanulaceae
dataset
a.
CMSC 838T – Presentation
Re-Engineering Techniques
Cache Awareness
Memory footprint
BPAnalysis: 60MB
b.
GRAPPA: 1.8MB
Memory locality
BPAnalysis: poor locality, working set size of about 12MB
b.
GRAPPA : good locality, working set size of about 600KB
Minimizing pointer dereferencing
Reuses allocated storage
Studies indicate that gain is likely to be factors of anywhere
from 2 to 40
a.
a.
CMSC 838T – Presentation
Re-Engineering Techniques
Low-level Algorithmic Changes
Using all of the available information
Examples:
Using lower bound to eliminate over 95% of the tree.
b.
Take advantage of special structures: TSP has only two
nontrivial edges( cost 1 and cost 2)
Speed-up by a factor of 5-10.
a.
CMSC 838T – Presentation
Re-Engineering Techniques: Parallel Aspects
Efficient Tree Generation,
Avoid unbounded-precision arithmetic
Allow generation from any count with variable gap
Provides parallel generation and also sampling of search space
Portable MPI implementation, each processor handles a fraction of
trees.
On the 512-processor Alliance cluster LOS LOBOS at UNM,
obtained a 512-fold speedup.
Summarize speedups:
Profiling: one order of magnitude
Cache awareness: factors of anywhere from 2 to 40
Low-level Algorithmic changes: 5-10
512-processor parallelism: 512
Overall, Grappa demonstrated a million-fold speedup over the original
implementation
CMSC 838T – Presentation
Evaluation: the Bluebell Family
Dataset: full gene sequences for the chloroplasts of
12 species of Campanulaceae (Bluebells), plus
tobacco.
Chloroplast
a.
A semi-independent organism that lives within plant cells
and allow them to photosynthesize.
b.
Have a single chromosome with about 120 genes.
Optimization target: reconstruct the phylogeny with
the least total amount of genomic changes.
Environment: 512-processor Los Lobos supercluster
at UNM
Results:
Speedup by three to four orders in the serial part
Total speedup by over one million
CMSC 838T – Presentation
Phylogeny of Bluebell Family
CMSC 838T – Presentation
Impact in Computational Biology
Much faster implementations
Alter the practice of research in biology and medicine
Reducing the time of an analysis from two years down to a day
Makes an enormous difference in the pace and cost of drug
discover and development
Fast and accurate analysis software
Enables researchers to pursue more leads, develop better
institution on small dataset
Form new conjectures about biological mechanism
CMSC 838T – Presentation
Observations
Algorithm re-engineering
Uncovers salient characteristic of the algorithm
Enable us to develop better algorithms
Example: find a true linear time algorithm for computing
inversion distance in the development of GRAPPA.
Can be applied to any existing bioinformatics algorithms
Several have been engineered for performance, such as
BLAST
Limited benefits in theoretical terms when applied to NP-hard
optimization problems
Does not scale up to “industrial-strengthen”
Grappa only enables to move from 10 taxa to 13 taxa
CMSC 838T – Presentation
Thank you
CMSC 838T – Presentation