Transcript cis667-10

Genome Rearrangements
CIS 667 April 13, 2004
Genome Rearrangements
• We have seen how differences in genes at
the sequence level can be used to infer
evolutionary relations among species
 Differences in sequences in (one or more)
genes resulted from point mutations (insert,
delete, substitute)
 These are not the only type of changes that
can occur in the genome
Genome Rearrangements
• Repair of broken chromosomes is an important
process
 Mistakes can occur, however
• Mistakes can also occur during crossover
• These mistakes cause changes in gene order
 A large piece of chromosome can be moved or copied
to another location
 It can also move from one chromosome to another
 We call these movements genome rearrangments
Crossover
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Chromosome Repair
Genome Rearrangements
• These have important (usually fatal)
consequences for the organism and its evolution
• Alignments do not capture genome
rearrangments
 Two species may have nearly the same gene
sequences, but in a different order (why would the two
species then be different?)
Genome Rearrangements
• We need some other way to compare entire
genomes (i.e. compare at a higher level)
• Rather than simple point mutations a
genome is obtained from another by a
number of a special kind of
rearrangements: Reversals
 Use the number of reversals needed to
transform one genome into another to measure
evolutionary distance
The Method
• Use combinatorial optimization techniques
in an attempt to infer a most
economical sequence of rearrangement
operations to account for differences
among the genomes
 Compare with character-based methods for
phylogenetics (parsimony)
Reversals
• Consider the genome of a species as a
sequence of blocks
 A block is some sequence of the genome
(possibly containing more than one gene)
transcribed as a unit
 Blocks are oriented since they can be
transcribed from either strand of DNA
 Give homologous blocks the same label
Reversals
• Relation between chloroplast genomes of
alfalfa and garden pea:
Reversals
• Reversal operation for oriented blocks:
 Inverts the order of affected blocks and
changes their orientation (arrow)
 Affects a contiguous segment of blocks
• What sequence of reversal operations
could have changed alfalfa into garden
pea?
 Would like to have a polynomial time
algorithm to find the shortest sequence
Genome Comparison vs. Gene
Comparison
• In the late 1980s, J. Palmer and his
colleagues studied the mitochondrial
genomes of cabbage and turnips
 The gene sequences are very similar (some
genes are 99% equal)
 Gene order, however, differs dramatically
 Genome rearrangements are now considered
to be a common mode of molecular evolution
Genome Comparison vs. Gene
Comparison
• Extreme conservation of genes on X
chromosomes across mammalian species
provides an opportunity to study the evolutionary
history of X chromosome independently of the
rest of the genomes
• According to Ohno´s law, the gene content of X
chromosome has barely changed throughout
mammalian development in the last 125 million
years.
• However, the order of genes on X chromosomes
has been disrupted several times.
Human and Mouse X
Chromosomes
Human and Mouse X
Chromosomes
-4
-6
1
3
2
7
7
2
4
5
8
-3
6
4
5
8
6
4
5
8
4
5
8
2
1
2
7
1
2
7
3
1
2
-5
-4
-3
4
5
-3
3
8
6
7
2
5
-1
1
1
-3
-6
6
6
7
7
8
8
Genome Comparison vs. Gene
Comparison
• The traditional molecular evolutionary technique
is a gene comparison to construct a
phylogenetic tree
• In the ”cabbage and turnip” case this is hardly
suitable, since rate of point mutations in their
mitochondrial genes is so low that their genes
are almost identical
• Genome comparison (i.e. comparison of gene
orders) is the method of choice in the case of
very slowly evolving genomes
• Another area is the case where genomes evolve
very rapidly (genes not very similar)
Genome Comparison
• Only about (17839) genome rearrangements
have happened since human and mouse
diverged 80 million years ago
 Mouse and human genomes can be viewed as a
collection of about 200 fragments which are shuffled
in mice as compared to humans
 A comparative mouse-human genetic map gives the
position of a human gene given the location of a
related mouse gene
Man-Mouse Comparative
Physical Map
Definitions
• A signed permutation a over the set of labels L =
{1, 2, …, n} is a permutation such that a(i) = +a
or –a, where a L
• Example: +3, –2, –1 is a signed permutation
over L = {1, 2, 3}
 Note that no label may appear twice in the
permutation
• A reversal [i,j] is an operation that transforms
one signed permutation into another, reversing
the order or a contiguous portion and flipping the
signs
Definitions
• a’ = a[i,j] = a(1), …, a(i – 1), –a(j), …, –
a(i), a(j + 1), …, a(n)
• We are interested in the problem of sorting
by reversals: Given two signed
permutations a and b, find the minimum
number of reversals r1, …, rt that will
transform a into b- ar1…rt = b
• The reversal distance db(a) = t
Definitions
• Note that the reversal operation does not
directly correspond to the biological
operations (inversion, translocation,
fission, fusion)
• Given a and b, can we always transform a
into b using only the reversal operation? If
so, how many reversals are required in the
worst case?
Breakpoints
• A breakpoint is a point between
consecutive labels in the initial
permutation that must necessarily be
separated by at least one reversal to reach
the target permutation
 The two consecutive labels are not
consecutive in the target, or their orientations
are not the same in a relative sense
Breakpoints
• To formalize the idea of breakpoint, we introduce
the extended version of a
• Let a = a(1), …, a(n)
• Then the extended version of a is (L, a(1), …,
a(n), R)
• For example let extended a be (L, –2, –3, +1,
+6, –5, –4, R) and let extended b be (L, +1, +2,
+3, +4, +5, +6, R)
• The breakpoints are: (L,–2), (–2,–3), (–3,+1),
(+1,+6), (6,–5), (–4,R)
Breakpoints
• The number of breakpoints of a
permutation ais denoted by b(a)
 In the example, = 6
• Can you characterize the situations where
L is involved in a breakpoint? When R is
involved in a breakpoint?
A Lower Bound
• A reversal can remove at most two
breakpoints
 Cuts the permutation in exactly two places
 So, if ar1 … rt =bthen
 b(a) – b(ar1)  2
 b(ar1) – b(ar1r2)  2
…
 b(ar1…rt-1) – b(ar1…rt)  2
 So b(a)  2t. If t = d(a), b(a)/2  d(a)
Reality and Desire Diagram
• The lower bound found is not very tight
• We can derive a better l.b. based on a structure
called the reality-desire diagram of a
permutation with respect to another
• To draw the diagram, we will represent +a with
the tuple (-a +a) and -a with the tuple (+a -a)
 The orientation is given by the rightmost member of
the tuple
Reality and Desire Diagram
• A permutation is a sequence of adjacent
tuples:
a=+3, –2, –1, +4, –5 can be represented as:
L---(–3 +3)---(+2 –2)---(+1 –1)---(–4 +4)---(+5
–5)---R
b=L---(–1 +1)---(–2 +2)---(–3 +3)---(–4 +4)---(–
5 +5)---R
Reality and Desire Diagram
• Now we will draw a graph to represent a
(L, +3, -2, -1, +4, -5, R)
 The reality diagram:
L
-3 +3
+2 -2
+1 -1
-4 +4
+5 -5
R
Reality and Desire Diagram
• Suppose that b is the identity (L, +1, +2,
+3, +4, +5)
 We will add desire edges to the previous
graph to represent b
L
-3 +3
+2 -2
+1 -1
-4 +4
+5 -5
R
Reality and Desire Diagram
• a is the reality
• b is what is desired
• The diagram (a multigraph) shows both
reality and desire
 Call it RD(a)
• We can rearrange the nodes of the graph
to make it easier to understand
Reality and Desire Diagram
L
R
Desire
Reality
Properties of RD(a)
• Each vertex has degree 2
 Each node is incident to one edge from A, the
set of reality edges, and B, the set of desire
edges
• The connected components of the graph
are alternating cycles (edges alternate
between reality - blue - and desire - red)
• Each cycle has an even number of edges,
half reality and half desire
Properties of RD(a)
• The number of cycles of RD(a)is denoted
by cb(a)
• Note that cb(b) = n + 1 since b has no
breakpoints
 All cycles are two parallel edges between the
same pair of nodes
 We have 2n + 2 nodes, so n + 1 cycles
 This is the only permutation for which cb(a) =
1
Properties of RD(a)
• So transforming a into bcan be seen as
transforming RD(a) into a graph with as
many cycles as possible - n + 1
• Now we need to see how a reversal
affects the cycles of RD(a)
 Note that a reversal is characterized by the
two points where it cuts the current
permutation, which each correspond to a
reality edge
Reversals and RD(a)
• Let r be a reversal defined by two reality
edges (s,t) and (u,v), then RD(ar) differs
from RD(a) as follows:
 Reality edges (s,t) and (u,v) are replaced by
(s,u) and (t,v)
 Vertices u, …, t are reversed
• Desire edges remain unchanged
• See example on following slide
Example
L
Some nodes/edges omitted
R
L
R
Orientation of Cycles
• How many cycles are affected by a
reversal?
• First we define convergent and divergent
edges
 Two reality edges on the same cycle
converge if they are traversed in the same
direction (clockwise or counterclockwise on
the circle in the diagram) on the cycle
 Otherwise they diverge
Orientation of Cycles
Convergent: (+3,+2) (-1,-4)
Divergent: (L,-3) (+3,+2)
L
R
Reversals and #Cycles
• Let r be a reversal acting on two reality
edges e and f
 If e and f belong to different cycles, c(ar) =
c(a) – 1
 If e and f belong to the same cycle and
converge, c(ar) = c(a)
 If e and f belong to the same cycle and
diverge, c(ar) = c(a) + 1
First Case
• If e and f belong to different cycles, c(ar) =
c(a) – 1
Second Case
 If e and f belong to the same cycle and
converge, c(ar) = c(a)
Third Case
 If e and f belong to the same cycle and
diverge, c(ar) = c(a) + 1
Reversals and #Cycles
• Note that the number of cycles changes by at
most one with each reversal
 Use that to find another lower bound for reversal
distance
 Suppose we have ar1r2…rt = bwe know that c(b)=n + 1 and
we have:
•
•
•
•
c(ar1) - c(a)  1
c(ar1r2) - c(ar1)  1
…
c(ar1r2…rt) - c(ar1r2…rt-1)  1
 Adding and cancelling terms we get
• n + 1 - c(a)  t
• If r1r2…rt is optimal then t = d(a), n + 1 - c(a)  d(a)
Interleaving Graph
• This new lower bound is better than the
old one - b(a)/2
 For most signed permutations, it is close to
the actual distance, however it does not
always work (we can’t always choose two
divergent edges)
• We can classify the cycles of RD(a) as
good or bad:
 A cycle is good if it has two divergent reality
edges
 Otherwise it is bad
Interleaving Graph
• The classification only applies to proper
cycles (those with at least four edges)
 Those with three edges don’t need to be
touched since reality = desire
• If we have only good cycles in a
permutation, then the lower bound
previously given is an equality
 We sort, increasing the number of cycles by
one per reversal
Interleaving Graph
• If a desire edge from one cycle crosses
some desire edge from another cycle we
say that the two cycles interleave
 Interleaved cycles allow us to change a bad
cycle into a good one while breaking another
cycle
 This good cycle can then broken in the next step
 To find interleaving cycles, we construct an
interleaving graph
Interleaving Graph
Interleaving Graph
• Nodes in the interleaving graph are cycles
• Edge between two nodes if the cycles
interleave
• The connected components of the graph
are called bad components if they consist
entirely of bad cycles
• Component otherwise is a good
component
Interleaving Graph
• What is the interleaving graph of the
previous example?
• Suppose that F and C are good cycles.
 Which components of the interleaving graph
are good and which are bad?
Sorting Good Components
• We need to choose two divergent edges in the
same cycle to define a reversal that increases
the number of cycles
• Example
• A reversal characterized by two divergent edges
of the same cycle is a sorting reversal if and only
if it does not lead to the creation of bad
components
Bad Components
• Using this criterion to sort all of the good
components, we must now sort the bad
ones
• Give a hierarchy of bad components
• We say a component B separates
components A and C if all chords in RD(a)
that link a terminal in A to a terminal in C
cross a desire edge of B
Diagram with no Good
Components
Bad Components
• Reversal through reality edges in different
components A and C will result in every
component B that separates A and C
being twisted
 A bad component becomes good when
twisted
 A good component can stay good or become
bad when twisted
 So twist only when no good components
Hierarchy of Bad Components
• A hurdle is a bad component that does not
separate any other two bad components
 If a bad component separates others, then it is a
nonhurdle
• A hurdle A protects a nonhurdle B when removal
of A would cause B to become a hurdle
 B is protected by A when every time B separates two
bad components, A is one of them
Hierarchy of Bad Components
• A hurdle A is called a superhurdle if it
protects some other nonhurdle B
 Otherwise it is called a simple hurdle
Bad Components
Nonhurdles
Simple hurdles
Hurdles
Super hurdles
Fortress
• A signed permutation a is called a fortress
iff RD(a) has an odd number of hurdles
and all of them are super hurdles
Reversal Distance
• The reversal distance of oriented
permutations is given by:
• d(a) = n + 1 - c(a) + h(a) + f(a)
 c(a) - number of cycles (proper and non)
 h(a) - number of hurdles
 f(a) - a a fortress? (1 else 0)
 n + 1 - c(a) good components and bad components
which become good during sort
 h(a) - bad components require extra reversal
 f(a) - extra reversal for fortress
Algorithm
• If we don’t have a good cycle we must use
either a reversal on two convergent edges
or a reversal on edges in different cycles
 In first case, number of cycles is constant
 In second case, number of cycles decreases
by one
 Choose case one on a hurdle
 Transforms bad component into good
 Number of cycles remains constant
Algorithm
• Getting rid of a non-hurdle doesn’t change
the number of hurdles or fortress status,
so distance remains the same
• If we reverse a superhurdle, the nonhurdle
it protects becomes a hurdle so h remains
constant
• Call reversal on some cycle in a hurdle
hurdle cutting
Algorithm
• In order not to increase f(a), use hurdle
cutting only when h(a) is odd
• Using reversal on edges in two different
cycles increases c(a)
 However d(a) will decrease if we can
decrease h(a) by two
 Choose edges from two different hurdles - this is
called hurdle merging
• The two hurdles as well as any nonhurdle separating them
become good components
Algorithm
• We have to be careful that hurdle merging
doesn’t transform a nonhurdle into a
hurdle
 A and B are called opposite hurdles when we
find the same number of hurdles walking the
circle clockwise from A to B as we do walking
counterclockwise
 This can only happen if h(a) is even
 Choosing opposite hurdles, we don’t turn a
nonhurdle into a hurdle
Algorithm
• To avoid creating a fortress where we
don’t have one, we choose the opposite
hurdles when they exist
• If h(a) is odd and we have a simple hurdle,
do hurdle cutting to avoid fortress
• If neither case if possible, we already have
a fortress so f(a) doesn’t increase with any
hurdle merging
Algorithm
Algorithm Sorting Reversal
input: distinct permutations a and b
output: a sorting reversal for a with target b
if there is a good component in RDb(a) then
pick 2 divergent edges e and f in this component,
making sure the corresponding reversal does not
create any bad components
return the reversal characterized by e and f
else if h(a) is even then
return merging of two opposite hurdles
else if h(a) is odd and there is a simple hurdle then
return a reversal cutting this hurdle
else // fortress
return merging of any two hurdles