Transcript cis667-10
Genome Rearrangements
CIS 667 April 13, 2004
Genome Rearrangements
• We have seen how differences in genes at
the sequence level can be used to infer
evolutionary relations among species
Differences in sequences in (one or more)
genes resulted from point mutations (insert,
delete, substitute)
These are not the only type of changes that
can occur in the genome
Genome Rearrangements
• Repair of broken chromosomes is an important
process
Mistakes can occur, however
• Mistakes can also occur during crossover
• These mistakes cause changes in gene order
A large piece of chromosome can be moved or copied
to another location
It can also move from one chromosome to another
We call these movements genome rearrangments
Crossover
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Chromosome Repair
Genome Rearrangements
• These have important (usually fatal)
consequences for the organism and its evolution
• Alignments do not capture genome
rearrangments
Two species may have nearly the same gene
sequences, but in a different order (why would the two
species then be different?)
Genome Rearrangements
• We need some other way to compare entire
genomes (i.e. compare at a higher level)
• Rather than simple point mutations a
genome is obtained from another by a
number of a special kind of
rearrangements: Reversals
Use the number of reversals needed to
transform one genome into another to measure
evolutionary distance
The Method
• Use combinatorial optimization techniques
in an attempt to infer a most
economical sequence of rearrangement
operations to account for differences
among the genomes
Compare with character-based methods for
phylogenetics (parsimony)
Reversals
• Consider the genome of a species as a
sequence of blocks
A block is some sequence of the genome
(possibly containing more than one gene)
transcribed as a unit
Blocks are oriented since they can be
transcribed from either strand of DNA
Give homologous blocks the same label
Reversals
• Relation between chloroplast genomes of
alfalfa and garden pea:
Reversals
• Reversal operation for oriented blocks:
Inverts the order of affected blocks and
changes their orientation (arrow)
Affects a contiguous segment of blocks
• What sequence of reversal operations
could have changed alfalfa into garden
pea?
Would like to have a polynomial time
algorithm to find the shortest sequence
Genome Comparison vs. Gene
Comparison
• In the late 1980s, J. Palmer and his
colleagues studied the mitochondrial
genomes of cabbage and turnips
The gene sequences are very similar (some
genes are 99% equal)
Gene order, however, differs dramatically
Genome rearrangements are now considered
to be a common mode of molecular evolution
Genome Comparison vs. Gene
Comparison
• Extreme conservation of genes on X
chromosomes across mammalian species
provides an opportunity to study the evolutionary
history of X chromosome independently of the
rest of the genomes
• According to Ohno´s law, the gene content of X
chromosome has barely changed throughout
mammalian development in the last 125 million
years.
• However, the order of genes on X chromosomes
has been disrupted several times.
Human and Mouse X
Chromosomes
Human and Mouse X
Chromosomes
-4
-6
1
3
2
7
7
2
4
5
8
-3
6
4
5
8
6
4
5
8
4
5
8
2
1
2
7
1
2
7
3
1
2
-5
-4
-3
4
5
-3
3
8
6
7
2
5
-1
1
1
-3
-6
6
6
7
7
8
8
Genome Comparison vs. Gene
Comparison
• The traditional molecular evolutionary technique
is a gene comparison to construct a
phylogenetic tree
• In the ”cabbage and turnip” case this is hardly
suitable, since rate of point mutations in their
mitochondrial genes is so low that their genes
are almost identical
• Genome comparison (i.e. comparison of gene
orders) is the method of choice in the case of
very slowly evolving genomes
• Another area is the case where genomes evolve
very rapidly (genes not very similar)
Genome Comparison
• Only about (17839) genome rearrangements
have happened since human and mouse
diverged 80 million years ago
Mouse and human genomes can be viewed as a
collection of about 200 fragments which are shuffled
in mice as compared to humans
A comparative mouse-human genetic map gives the
position of a human gene given the location of a
related mouse gene
Man-Mouse Comparative
Physical Map
Definitions
• A signed permutation a over the set of labels L =
{1, 2, …, n} is a permutation such that a(i) = +a
or –a, where a L
• Example: +3, –2, –1 is a signed permutation
over L = {1, 2, 3}
Note that no label may appear twice in the
permutation
• A reversal [i,j] is an operation that transforms
one signed permutation into another, reversing
the order or a contiguous portion and flipping the
signs
Definitions
• a’ = a[i,j] = a(1), …, a(i – 1), –a(j), …, –
a(i), a(j + 1), …, a(n)
• We are interested in the problem of sorting
by reversals: Given two signed
permutations a and b, find the minimum
number of reversals r1, …, rt that will
transform a into b- ar1…rt = b
• The reversal distance db(a) = t
Definitions
• Note that the reversal operation does not
directly correspond to the biological
operations (inversion, translocation,
fission, fusion)
• Given a and b, can we always transform a
into b using only the reversal operation? If
so, how many reversals are required in the
worst case?
Breakpoints
• A breakpoint is a point between
consecutive labels in the initial
permutation that must necessarily be
separated by at least one reversal to reach
the target permutation
The two consecutive labels are not
consecutive in the target, or their orientations
are not the same in a relative sense
Breakpoints
• To formalize the idea of breakpoint, we introduce
the extended version of a
• Let a = a(1), …, a(n)
• Then the extended version of a is (L, a(1), …,
a(n), R)
• For example let extended a be (L, –2, –3, +1,
+6, –5, –4, R) and let extended b be (L, +1, +2,
+3, +4, +5, +6, R)
• The breakpoints are: (L,–2), (–2,–3), (–3,+1),
(+1,+6), (6,–5), (–4,R)
Breakpoints
• The number of breakpoints of a
permutation ais denoted by b(a)
In the example, = 6
• Can you characterize the situations where
L is involved in a breakpoint? When R is
involved in a breakpoint?
A Lower Bound
• A reversal can remove at most two
breakpoints
Cuts the permutation in exactly two places
So, if ar1 … rt =bthen
b(a) – b(ar1) 2
b(ar1) – b(ar1r2) 2
…
b(ar1…rt-1) – b(ar1…rt) 2
So b(a) 2t. If t = d(a), b(a)/2 d(a)
Reality and Desire Diagram
• The lower bound found is not very tight
• We can derive a better l.b. based on a structure
called the reality-desire diagram of a
permutation with respect to another
• To draw the diagram, we will represent +a with
the tuple (-a +a) and -a with the tuple (+a -a)
The orientation is given by the rightmost member of
the tuple
Reality and Desire Diagram
• A permutation is a sequence of adjacent
tuples:
a=+3, –2, –1, +4, –5 can be represented as:
L---(–3 +3)---(+2 –2)---(+1 –1)---(–4 +4)---(+5
–5)---R
b=L---(–1 +1)---(–2 +2)---(–3 +3)---(–4 +4)---(–
5 +5)---R
Reality and Desire Diagram
• Now we will draw a graph to represent a
(L, +3, -2, -1, +4, -5, R)
The reality diagram:
L
-3 +3
+2 -2
+1 -1
-4 +4
+5 -5
R
Reality and Desire Diagram
• Suppose that b is the identity (L, +1, +2,
+3, +4, +5)
We will add desire edges to the previous
graph to represent b
L
-3 +3
+2 -2
+1 -1
-4 +4
+5 -5
R
Reality and Desire Diagram
• a is the reality
• b is what is desired
• The diagram (a multigraph) shows both
reality and desire
Call it RD(a)
• We can rearrange the nodes of the graph
to make it easier to understand
Reality and Desire Diagram
L
R
Desire
Reality
Properties of RD(a)
• Each vertex has degree 2
Each node is incident to one edge from A, the
set of reality edges, and B, the set of desire
edges
• The connected components of the graph
are alternating cycles (edges alternate
between reality - blue - and desire - red)
• Each cycle has an even number of edges,
half reality and half desire
Properties of RD(a)
• The number of cycles of RD(a)is denoted
by cb(a)
• Note that cb(b) = n + 1 since b has no
breakpoints
All cycles are two parallel edges between the
same pair of nodes
We have 2n + 2 nodes, so n + 1 cycles
This is the only permutation for which cb(a) =
1
Properties of RD(a)
• So transforming a into bcan be seen as
transforming RD(a) into a graph with as
many cycles as possible - n + 1
• Now we need to see how a reversal
affects the cycles of RD(a)
Note that a reversal is characterized by the
two points where it cuts the current
permutation, which each correspond to a
reality edge
Reversals and RD(a)
• Let r be a reversal defined by two reality
edges (s,t) and (u,v), then RD(ar) differs
from RD(a) as follows:
Reality edges (s,t) and (u,v) are replaced by
(s,u) and (t,v)
Vertices u, …, t are reversed
• Desire edges remain unchanged
• See example on following slide
Example
L
Some nodes/edges omitted
R
L
R
Orientation of Cycles
• How many cycles are affected by a
reversal?
• First we define convergent and divergent
edges
Two reality edges on the same cycle
converge if they are traversed in the same
direction (clockwise or counterclockwise on
the circle in the diagram) on the cycle
Otherwise they diverge
Orientation of Cycles
Convergent: (+3,+2) (-1,-4)
Divergent: (L,-3) (+3,+2)
L
R
Reversals and #Cycles
• Let r be a reversal acting on two reality
edges e and f
If e and f belong to different cycles, c(ar) =
c(a) – 1
If e and f belong to the same cycle and
converge, c(ar) = c(a)
If e and f belong to the same cycle and
diverge, c(ar) = c(a) + 1
First Case
• If e and f belong to different cycles, c(ar) =
c(a) – 1
Second Case
If e and f belong to the same cycle and
converge, c(ar) = c(a)
Third Case
If e and f belong to the same cycle and
diverge, c(ar) = c(a) + 1
Reversals and #Cycles
• Note that the number of cycles changes by at
most one with each reversal
Use that to find another lower bound for reversal
distance
Suppose we have ar1r2…rt = bwe know that c(b)=n + 1 and
we have:
•
•
•
•
c(ar1) - c(a) 1
c(ar1r2) - c(ar1) 1
…
c(ar1r2…rt) - c(ar1r2…rt-1) 1
Adding and cancelling terms we get
• n + 1 - c(a) t
• If r1r2…rt is optimal then t = d(a), n + 1 - c(a) d(a)
Interleaving Graph
• This new lower bound is better than the
old one - b(a)/2
For most signed permutations, it is close to
the actual distance, however it does not
always work (we can’t always choose two
divergent edges)
• We can classify the cycles of RD(a) as
good or bad:
A cycle is good if it has two divergent reality
edges
Otherwise it is bad
Interleaving Graph
• The classification only applies to proper
cycles (those with at least four edges)
Those with three edges don’t need to be
touched since reality = desire
• If we have only good cycles in a
permutation, then the lower bound
previously given is an equality
We sort, increasing the number of cycles by
one per reversal
Interleaving Graph
• If a desire edge from one cycle crosses
some desire edge from another cycle we
say that the two cycles interleave
Interleaved cycles allow us to change a bad
cycle into a good one while breaking another
cycle
This good cycle can then broken in the next step
To find interleaving cycles, we construct an
interleaving graph
Interleaving Graph
Interleaving Graph
• Nodes in the interleaving graph are cycles
• Edge between two nodes if the cycles
interleave
• The connected components of the graph
are called bad components if they consist
entirely of bad cycles
• Component otherwise is a good
component
Interleaving Graph
• What is the interleaving graph of the
previous example?
• Suppose that F and C are good cycles.
Which components of the interleaving graph
are good and which are bad?
Sorting Good Components
• We need to choose two divergent edges in the
same cycle to define a reversal that increases
the number of cycles
• Example
• A reversal characterized by two divergent edges
of the same cycle is a sorting reversal if and only
if it does not lead to the creation of bad
components
Bad Components
• Using this criterion to sort all of the good
components, we must now sort the bad
ones
• Give a hierarchy of bad components
• We say a component B separates
components A and C if all chords in RD(a)
that link a terminal in A to a terminal in C
cross a desire edge of B
Diagram with no Good
Components
Bad Components
• Reversal through reality edges in different
components A and C will result in every
component B that separates A and C
being twisted
A bad component becomes good when
twisted
A good component can stay good or become
bad when twisted
So twist only when no good components
Hierarchy of Bad Components
• A hurdle is a bad component that does not
separate any other two bad components
If a bad component separates others, then it is a
nonhurdle
• A hurdle A protects a nonhurdle B when removal
of A would cause B to become a hurdle
B is protected by A when every time B separates two
bad components, A is one of them
Hierarchy of Bad Components
• A hurdle A is called a superhurdle if it
protects some other nonhurdle B
Otherwise it is called a simple hurdle
Bad Components
Nonhurdles
Simple hurdles
Hurdles
Super hurdles
Fortress
• A signed permutation a is called a fortress
iff RD(a) has an odd number of hurdles
and all of them are super hurdles
Reversal Distance
• The reversal distance of oriented
permutations is given by:
• d(a) = n + 1 - c(a) + h(a) + f(a)
c(a) - number of cycles (proper and non)
h(a) - number of hurdles
f(a) - a a fortress? (1 else 0)
n + 1 - c(a) good components and bad components
which become good during sort
h(a) - bad components require extra reversal
f(a) - extra reversal for fortress
Algorithm
• If we don’t have a good cycle we must use
either a reversal on two convergent edges
or a reversal on edges in different cycles
In first case, number of cycles is constant
In second case, number of cycles decreases
by one
Choose case one on a hurdle
Transforms bad component into good
Number of cycles remains constant
Algorithm
• Getting rid of a non-hurdle doesn’t change
the number of hurdles or fortress status,
so distance remains the same
• If we reverse a superhurdle, the nonhurdle
it protects becomes a hurdle so h remains
constant
• Call reversal on some cycle in a hurdle
hurdle cutting
Algorithm
• In order not to increase f(a), use hurdle
cutting only when h(a) is odd
• Using reversal on edges in two different
cycles increases c(a)
However d(a) will decrease if we can
decrease h(a) by two
Choose edges from two different hurdles - this is
called hurdle merging
• The two hurdles as well as any nonhurdle separating them
become good components
Algorithm
• We have to be careful that hurdle merging
doesn’t transform a nonhurdle into a
hurdle
A and B are called opposite hurdles when we
find the same number of hurdles walking the
circle clockwise from A to B as we do walking
counterclockwise
This can only happen if h(a) is even
Choosing opposite hurdles, we don’t turn a
nonhurdle into a hurdle
Algorithm
• To avoid creating a fortress where we
don’t have one, we choose the opposite
hurdles when they exist
• If h(a) is odd and we have a simple hurdle,
do hurdle cutting to avoid fortress
• If neither case if possible, we already have
a fortress so f(a) doesn’t increase with any
hurdle merging
Algorithm
Algorithm Sorting Reversal
input: distinct permutations a and b
output: a sorting reversal for a with target b
if there is a good component in RDb(a) then
pick 2 divergent edges e and f in this component,
making sure the corresponding reversal does not
create any bad components
return the reversal characterized by e and f
else if h(a) is even then
return merging of two opposite hurdles
else if h(a) is odd and there is a simple hurdle then
return a reversal cutting this hurdle
else // fortress
return merging of any two hurdles