20 years and 22 papers with Bernard Moret

Download Report

Transcript 20 years and 22 papers with Bernard Moret

20 years and 22 papers with
Bernard Moret
Tandy Warnow
The University of Illinois at Urbana-Champaign
Brief history
• We met in 1992, when I was working in the
Discrete Algorithms Group at Sandia National
Labs.
• I moved to the University of Pennsylvania in
1993, then to the University of Texas at Austin
in 1999. No papers yet… we’re just friends.
In 1999, Bob Jansen needed help with
13 Campanulaceae genomes
In 1999, Bob Jansen needed help with
13 Campanulaceae genomes
In 1999, Bob Jansen needed help with
13 Campanulaceae genomes
1999-2006, 22 papers with Bernard
• Genome rearrangement phylogeny estimation
• Phylogenetic network estimation
• Comparing phylogenetic methods on large
datasets
• Absolute fast converging methods
1999-2006, 22 papers with Bernard
• Genome rearrangement phylogeny estimation
• Phylogenetic network estimation
• Comparing phylogenetic methods on large
datasets
• Absolute fast converging methods
1999-2006, 22 papers with Bernard
• Genome rearrangement phylogeny estimation
• Phylogenetic network estimation
• Comparing phylogenetic methods on large
datasets
• Absolute fast converging methods
1999-2006, 22 papers with Bernard
• Genome rearrangement phylogeny estimation
• Phylogenetic network estimation
• Comparing phylogenetic methods on large
datasets
• Absolute fast converging methods
Highlights (this talk)
• Genome rearrangement phylogeny estimation
• Phylogenetic network estimation
• Comparing phylogenetic methods on large
datasets
• Absolute fast converging methods
Highlight #1
• Genome rearrangement phylogeny estimation
Too many papers to list
In 1999, Bob Jansen needed help with
13 Campanulaceae genomes
Genome Rearrangement Phylogeny
A
A
C
D
X
B
E
Y
E
Z
C
F
B
D
W
F
Breakpoint Phylogeny
Proposed by Sankoff and Blanchette in J Comp. Biol. 1998
• Input: Chromosomes given as signed gene orders, one copy of
each gene in each chromosome
• Output: Tree with the minimum number of breakpoints
BPAnalysis: Heuristic for the Breakpoint Phylogeny
Sankoff and Blanchette, 1998
BPAnalysis: Heuristic for the Breakpoint Phylogeny
Sankoff and Blanchette, 1998
Finding the breakpoint median of three genomes is
NP-hard (Pe’er and Shamir 1998), but can be
solved using TSP (Travelling Salesman Problem)
solvers (Blanchette and Sankoff 1997).
BPAnalysis: Heuristic for the Breakpoint Phylogeny
Sankoff and Blanchette, 1998
Bernard and I estimated that BPAnalysis would take
~200 CPU years to complete on Bob’s dataset.
MPBE
• Maximum Parsimony on Binary Encoding
– Character for every possible adjacency (is oriented
gene x followed by oriented gene y?)
• Mary Cosner PhD dissertation 1993
• Fast because maximum parsimony heuristics
are relatively efficient
• Can find infeasible solutions
MPBE on Bob’s dataset suggested
MPBE
• Maximum Parsimony on Binary Encoding
– Character for every possible adjacency (is oriented
gene x followed by oriented gene y?)
• Mary Cosner PhD dissertation 1993
• Fast because maximum parsimony heuristics
are relatively efficient
• Can find infeasible solutions
MPBE on Bob’s dataset suggested
MPBE
• Maximum Parsimony on Binary Encoding
– Character for every possible adjacency (is oriented
gene x followed by oriented gene y?)
• Mary Cosner PhD dissertation 1993
• Fast because maximum parsimony heuristics
are relatively efficient
• Can find infeasible solutions
MPBE on Bob’s dataset suggested transpositions –
very
surprising.
MPBE
on Bob’s dataset suggested
Neighbor Joining on Breakpoint Distances
40 taxa, 120 genes,
Inv.:Transp.:InvTrans
p=2:1:1
Error
in
inferred
tree
NJ(BP)
birth-death trees,
expected deviation
from ultrametricity=2
Amount of evolution
NJ(BP): seconds to complete
Phylogeny reconstruction in 1999
• Distance-based
– Breakpoint (BP) distances [Blanchette, Kunisawa, Sankoff
1999]
• Breakpoint tree (NP-hard, even for three taxa)
– BPAnalysis: [Sankoff & Blanchette 1998]: exhaustive search
through treespace to find the minimum breakpoint length
(the number of breakpoints on the tree)
• MPBE [Cosner 1993]: maximum parsimony on binary encoding
Phylogeny reconstruction in 1999
• Distance-based
– Breakpoint (BP) distances [Blanchette, Kunisawa, Sankoff
1999] – fast but high error
• Breakpoint tree (NP-hard, even for three taxa)
– BPAnalysis: [Sankoff & Blanchette 1998]: exhaustive search
through treespace to find the minimum breakpoint length
(the number of breakpoints on the tree) – too slow
• MPBE [Cosner 1993]: maximum parsimony on binary encoding:
can find infeasible ancestors
The challenges!
1. Find all the best genome trees for Bob’s
dataset, and determine if inversions suffice
(or if we really do need transpositions).
2. Design statistically rigorous methods for
genome rearrangement phylogeny.
3. Design efficient techniques to enable
genome-scale phylogeny for large datasets
that are difficult to analyze.
Genomes Evolve by Rearrangements
1
2
3
4
5
6
7
9
10
1
2
3 –8
9 –7
-8
4 –6
–7
5 –5
–6
6 -4
–5
7 –4
8
9
10
• Inversion (Reversal)
• Transposition
• Inverted Transposition
8
Generalized Nadeau-Taylor (GNT) Model
Proposed in Wang and Warnow, STOC 2001.
• Each type of event (inversion, transposition,
and inverted transposition) has a probability
of occurring, and is specified by
– GNT(a,b,c): a+b+c=1
• All events of the same type are equiprobable
• The tree has branch lengths indicating the
expected number of events on each branch.
Distance-based methods
• Breakpoint distances (Blanchette, Bourque, and
Sankoff 1997)
• Inversion distances (Bader, Moret, and Yan 2001)
• EDE (Empirically-Derived Estimator of true
evolutionary distance), Moret et al., ISMB 2001 –
derived from inversion-only model
• IEBP (Wang and Warnow STOC 2001): estimates
true evolutionary distance but needs to know or
estimate the GNT parameters.
Figure 3 from Moret and Warnow, 2004
40 taxa, 120 genes
Inv.:Transp.:InvTransp
=2:1:1
Birth-death trees,
expected deviation from
ultrametricity=2
Amount of evolution
BP=breakpoint distance
INV=inversion distance
EDE: statistically-based estimator [Wang et al. ‘01] - highly robust.
All these methods are polynomial time.
Moret et al. breakpoint phylogeny approach:
2,000,000-fold speedup over BPAnalysis as serial
codes (parallelism brings it higher)
From Moret, Tang, and Warnow, 2004
Bounding
• Upper bound on the best score: score the NJ(EDE)
tree using improved implementation of BPAnalysis.
• Lower bound on a given tree T using the circular
ordering on leaves: greedy technique to find the
planar embedding that achieves the highest
breakpoint score for its circular odering – half that is
a lower bound on T’s breakpoint score.
By the way, Bernard and I had a big argument about using
this lower bound… he didn’t think it would work, but it
did!
Bounding
• Upper bound on the best score: score the NJ(EDE)
tree using improved implementation of BPAnalysis.
• Lower bound on a given tree T using the circular
ordering on leaves: greedy technique to find the
planar embedding that achieves the highest
breakpoint score for its circular ordering – half that is
a lower bound on T’s breakpoint score.
By the way, Bernard and I had a big argument about using
this lower bound… he didn’t think it would work, but it
did!
GRAPPA
http://www.cs.unm.edu/~moret/GRAPPA/
• Genome Rearrangement Analysis under Parsimony
and other Phylogenetic Algorithms
• Heuristics for NP-hard optimization problems
• Uses high-level algorithmic ideas with low-level
algorithms engineering to dramatically speed-up the
searches for the breakpoint and inversion phylogenies.
• 2000-2004
• Project leader: Bernard Moret
In 1999, Bob needed help with 13
Campanulaceae genomes
The challenges!
1. Find all the best genome trees for Bob’s
dataset, and determine if inversions suffice
(or if we really do need transpositions).
2. Design statistically rigorous methods for
genome rearrangement phylogeny.
3. Design efficient techniques to enable
genome-scale phylogeny for large datasets
that are difficult to analyze.
What we found
• Optimal solutions for the breakpoint
phylogeny, for the inversion-only phylogeny,
and for a weighted sum of inversions and
transpositions.
• 67 inversions suffices for this dataset, and no
transpositions are needed!
This was just the beginning…
Other events, such as
• Duplications, Insertions, and Deletions
• Fissions and Fusions
Other models
• HP (Hannenhali and Pevzner)
• Double Cut-and-Join (Yancopoulos et al.)
Other techniques
• DCM-boosting to scale GRAPPA to large datasets (1000 species)
• Estimating true evolutionary distances under these complex models
• Inferring ancestral genomes under these complex models
Bernard has made progress on all these things!! (But without me, so
someone else can talk about this…)
This was just the beginning…
Other events, such as
• Duplications, Insertions, and Deletions
• Fissions and Fusions
Other models
• HP (Hannenhali and Pevzner)
• Double Cut-and-Join (Yancopoulos et al.)
Other techniques
• DCM-boosting to scale GRAPPA to large datasets (1000 species)
• Estimating true evolutionary distances under these complex models
• Inferring ancestral genomes under these complex models
Bernard has made progress on all these things!! (But without me, so
someone else can talk about this…)
Highlight #2
Absolute fast converging methods
• Warnow, Moret and St. John, SODA 2001
• Nakhleh, Moret, etc. PSB 2002
• Moret, Roshan, and Warnow, WABI 2002
• Moret, Wang, and Warnow 2002 (IEEE
Computer)
Markov Model of Site Evolution
Simplest (Jukes-Cantor, 1969):
• The model tree T is binary and has substitution probabilities p(e) on
each edge e.
• The state at the root is randomly drawn from {A,C,T,G} (nucleotides)
• If a site (position) changes on an edge, it changes with equal probability
to each of the remaining states.
• The evolutionary process is Markovian.
The different sites are assumed to evolve independently and identically
down the tree (with rates that are drawn from a gamma distribution).
More complex models (such as the General Markov model) are also
considered, often with little change to the theory.
Quantifying Error
FN
FN: false negative
(missing edge)
FP: false positive
(incorrect edge)
50% error rate
FP
Statistical consistency
error
Data
Distance-based estimation
Neighbor Joining (NJ) on large trees
Error Rate
0.8
Simulation study based
upon fixed edge
lengths, K2P model of
evolution, sequence
lengths fixed to 1000
nucleotides.
Error rates reflect
proportion of incorrect
edges in inferred trees.
NJ
0.6
0.4
0.2
[Nakhleh et al. ISMB 2001]
0
0
400
800
No. Taxa
1200
1600
In other words…
error
Data
Statistical consistency doesn’t guarantee accuracy
w.h.p. unless the sequences are long enough.
Sequence length requirements
The sequence length (number of sites) that a phylogeny
reconstruction method M needs to reconstruct the
true tree with probability at least 1- depends on
• M (the method)
•
• f = min p(e),
• g = max p(e), and
• n, the number of leaves
We fix everything but n.
Neighbor Joining’s sequence length
requirement is exponential!
• Atteson 1999: Let T be a Jukes-Cantor model
tree defining additive matrix D. Then Neighbor
Joining will reconstruct the true tree with high
probability from sequences that are of length
O(lg n emax Dij).
• Lacey and Chang 2009: Matching lower bound
Statistical consistency, exponential convergence,
and absolute fast convergence (afc)
“Boosting” phylogeny reconstruction
methods
• DCMs “boost” the performance of phylogeny
reconstruction methods.
Base method M
DCM
DCM-M
Distance-based estimation
Divide-and-conquer for phylogeny
estimation
Divide-and-conquer for phylogeny
estimation
Construct subset trees
Supertree Step
Refinement Step
DCM1 Decompositions
Input: Set S of sequences, distance matrix d, threshold value
q Î{dij}
1. Compute threshold graph
Gq = (V , E),V = S , E = {(i, j ) : d (i, j ) £ q}
2. Perform minimum weight triangulation (note: if d is an additive matrix, then
the threshold graph is provably triangulated).
DCM1 decomposition :
Compute maximal cliques
DCM1-boosting:
Warnow, St. John, and Moret,
SODA 2001
Exponentially
converging
(base) method
DCM1
SQS
Absolute fast
converging
(DCM1-boosted)
method
• The DCM1 phase produces a collection of trees (one for each threshold),
and the SQS phase picks the “best” tree.
• For a given threshold, the base method is used to construct trees on
small subsets (defined by the threshold) of the taxa. These small trees
are then combined into a tree on the full set of taxa.
Neighbor Joining on large diameter trees
Error Rate
0.8
Simulation study based
upon fixed edge
lengths, K2P model of
evolution, sequence
lengths fixed to 1000
nucleotides.
Error rates reflect
proportion of incorrect
edges in inferred trees.
NJ
0.6
0.4
0.2
[Nakhleh et al. ISMB 2001]
0
0
400
800
No. Taxa
1200
1600
DCM1-boosting distance-based methods
[Nakhleh et al. ISMB 2001]
0.8
NJ
Error Rate
DCM1-NJ
0.6
0.4
0.2
0
0
400
800
No. Taxa
1200
• Theorem (Warnow,
Moret, and St. John,
SODA 2001):
• DCM1-NJ converges
to the true tree
from polynomial
length sequences
1600
DCM1-boosting distance-based methods
[Nakhleh et al. ISMB 2001]
0.8
NJ
Error Rate
DCM1-NJ
0.6
0.4
0.2
0
0
400
800
No. Taxa
1200
• Theorem (Warnow,
Moret, and St. John,
SODA 2001):
• DCM1-boosting:
reducing sequence
length requirements
for gene tree
accuracy w.h.p.
from exponential to
polynomial
1600
Highlight #3
Building the Tree of Life: A National Resource for Phyloinformatics
and Computational Phylogenetics
NSF Large ITR grant ($11.6 Million), 2003-2010
Directors:
Bernard Moret (2003-2006) and Tandy Warnow (2006-2010)
Other key personnel: Mark Holder, Junhyong Kim, Wayne Maddison,
Mark Miller, Brent Mishler, Satish Rao, David Swofford, and Val
Tannen.
CIPRES: graduate students and postdocs
Francois Barbancon, Nicholas Bray,
Kevin Chen, Shirley Cohen,
Costis Daskalakis, Nick Eriksson,
Yu Fan, Kirsen Fisher, Ganesh Ganapathy,
Sheng Guo, Tracy Heath, Cameron Hill,
David Kysela, Ruth Kirkpatrick, Henry Lin,
Kevin Lin, Wenguo Lin, Andrew McGregor,
Frank Mannino, Rui Mao, Radu Mihaescu,
Eric Miller, Luay Nakhleh,
Manikandan Narayanan, Serita Nelesen,
Smriti Ramakrishinan, Samantha Riesenfeld,
Sebastien Roch, Usman Roshan,
Ariel Schwartz, Stephen Smith, Errol Strain,
Jeet Sukumaran, Shel Swenson, Kunal Talwar,
Andres Varon, Rutger Vos, Yifeng Zheng, and
Derrick Zwickl
Michael Alfaro
Mark Holder
Peter Midford
Sagi Snir
Shel Swenson
Rutger Vos
CIPRES: graduate students and postdocs
Francois Barbancon, Nicholas Bray,
Kevin Chen, Shirley Cohen,
Costis Daskalakis, Nick Eriksson,
Yu Fan, Kirsen Fisher, Ganesh Ganapathy,
Sheng Guo, Tracy Heath, Cameron Hill,
David Kysela, Ruth Kirkpatrick, Henry Lin,
Kevin Lin, Wenguo Lin, Andrew McGregor,
Frank Mannino, Rui Mao, Radu Mihaescu,
Eric Miller, Luay Nakhleh,
Manikandan Narayanan, Serita Nelesen,
Smriti Ramakrishinan, Samantha Riesenfeld,
Sebastien Roch, Usman Roshan,
Ariel Schwartz, Stephen Smith, Errol Strain,
Jeet Sukumaran, Shel Swenson, Kunal Talwar,
Andres Varon, Rutger Vos, Yifeng Zheng, and
Derrick Zwickl
Michael Alfaro
Mark Holder
Peter Midford
Sagi Snir
Shel Swenson
Rutger Vos
And many others who were funded
by other grants
225 papers and dissertations
were supported by CIPRES, many
by CIPRES students and postdocs
The CIPRES Gateway
SDSC Supercomputers,
CIPRES Gateway Help Define
New “Tree of Life”
SDSC Press Release, April 25, 2016
(Figure courtesy of Laura Hug,
Jill Banfield, and Nature Microbiology)
Bernard isn’t really retiring
• He’s just going to be working from a warmer
place…
• We’ll keep him busy with collaborations via
internet and phone…
• And if we have to, we’ll go visit him in his new
home.
Bernard isn’t really retiring
• He’s just going to be working from a warmer
place…
• We’ll keep him busy with collaborations via
internet and phone…
• And if we have to, we’ll go visit him in his new
home.
Bernard isn’t really retiring
• He’s just going to be working from a warmer
place…
• We’ll keep him busy with collaborations via
internet and phone…
• And if we have to, we’ll go visit him in his new
home.
Bernard isn’t really retiring
• He’s just going to be working from a warmer
place…
• We’ll keep him busy with collaborations via
internet and phone…
• And if we have to, we’ll go visit him in his new
home.
Bon Voyage, Bernard!