Genome Rearrangements, Synteny, and Comparative Mapping

Download Report

Transcript Genome Rearrangements, Synteny, and Comparative Mapping

Genome Rearrangements,
Synteny, and
Comparative Mapping
CSCI 4830:
Algorithms for Molecular Biology
Debra S. Goldberg
Turnip vs Cabbage
• Share a recent common ancestor
• Look and taste different
Turnip vs Cabbage
• Comparing mtDNA gene sequences yields
no evolutionary information
• 99% similarity between genes
• These surprisingly identical gene
sequences differed in gene order
• This study helped pave the way to
analyzing genome rearrangements in
molecular evolution
Turnip vs Cabbage:
Different mtDNA Gene Order
• Gene order comparison:
Turnip vs Cabbage:
Different mtDNA Gene Order
• Gene order comparison:
Turnip vs Cabbage:
Different mtDNA Gene Order
• Gene order comparison:
Turnip vs Cabbage:
Different mtDNA Gene Order
• Gene order comparison:
Turnip vs Cabbage:
Gene Order Comparison
Before
After
• Evolution is manifested as the divergence
in gene order
Transforming
Cabbage into Turnip
Reversals
1
2
3
9
8
4
7
1, 2, 3, -8,
4, -7,
5, -6,
6, -5,
7, -4,
8, 9, 10


10
6
5
Blocks represent conserved genes.
In the course of evolution or in a clinical context, blocks 1,…,10
could be misread as 1, 2, 3, -8, -7, -6, -5, -4, 9, 10.
Types of Mutations
Types of Rearrangements
Inversion/Reversal
1 2 3 4 5 6
1 2 -5 -4 -3 6
Types of Rearrangements
Translocation
1 2 3
45 6
1 26
4 53
Types of Rearrangements
Fusion
1 2 3 4
5 6
1 2 3 4 5 6
Fission
Genome rearrangements
Mouse (X chrom.)
Unknown ancestor
~ 75 million years ago
Human (X chrom.)
• What are the similarity blocks and how to find
them?
• What is the architecture of the ancestral
genome?
• What is the evolutionary scenario for
transforming one genome into the other?
Why do we care?
SKY (spectral karyotyping)
Robertsonian Translocation
13
14
Robertsonian Translocation
13
14
• Translocation of chromosomes 13 and 14
• No net gain or loss of genetic material:
normal phenotype.
• Increased risk for an abnormal child or
spontaneous pregnancy loss
Philadelphia Chromosome
9
22
Philadelphia Chromosome
9
22
• A translocation between chromosomes 9
and 22 (part of 22 is attached to 9)
• Seen in about 90% of patients with
Chronic myelogenous leukemia (CML)
Colon cancer
Colon Cancer
Comparative maps
Waardenburg’s Syndrome:
Mouse Provides Insight into Human
Genetic Disorder
• Characterized by pigmentary dysphasia
• Gene implicated linked to human
chromosome 2
• It was not clear
where exactly on
chromosome 2
Waardenburg’s syndrome
and splotch mice
• A breed of mice (with splotch gene) had
similar symptoms caused by the same
type of gene as in humans
• Scientists identified location of gene
responsible for disorder in mice
• Finding the gene in mice gives clues to
where same gene is located in humans
Reversals: Example
p=12345678
12543678
Reversals: Example
p=12345678
12543678
12546378
Reversals and Gene Orders
• Gene order represented by permutation p:
p = p 1 ------ p i-1 p i p i+1 ------ p j-1 p j p j+1 ----- p n
r(i,j)
p 1 ------ p i-1 p j p j-1 ------ p i+1 p i p j+1 ----- pn

Reversal r ( i, j ) reverses (flips) the
elements from i to j in p
Reversal Distance Problem
• Goal: Given two permutations, find shortest
series of reversals to transform one into another
• Input: Permutations p and s
• Output: A series of reversals r1,…rt
transforming p into s, such that t is minimum
• t - reversal distance between p and s
• d(p, s) = smallest possible value of t, given p, s
Sorting By Reversals Problem
• Goal: Given a permutation, find a shortest series
of reversals that transforms it into the identity
permutation (1 2 … n )
• Input: Permutation p
• Output: A series of reversals r1, … rt
transforming p into the identity permutation such
that t is minimum
• min t =d(p ) = reversal distance of p
Sorting by reversals
Example: 5 steps
Step
Step
Step
Step
Step
Step
0: p 2 -4
1:
2 3
2:
2 3
3:
2 3
4:
-8 -7
5: g 1 2
-3
4
4
4
-6
3
5
5
5
5
-5
4
-8
-8
6
6
-4
5
-7
-7
7
7
-3
6
-6
-6
8
8
-2
7
1
1
1
-1
-1
8
Sorting by reversals
Example: 4 steps
Step
Step
Step
Step
Step
0: p 2 -4 -3
1:
2 3 4
2:
-5 -4 -3
3:
-5 -4 -3
4: g 1 2 3
5
5
-2
-2
4
-8
-8
-8
-1
5
-7
-7
-7
6
6
-6
-6
-6
7
7
1
1
1
8
8
What is the reversal distance for this
permutation? Can it be sorted in 3 steps?
Pancake Flipping Problem
• Chef prepares unordered
stack of pancakes of
different sizes
• The waiter wants to sort
(rearrange) them, smallest
on top, largest at bottom
• He does it by flipping over
several from the top,
repeating this as many
times as necessary
Christos Papadimitrou and
Bill Gates flip pancakes
Sorting By Reversals:
A Greedy Algorithm
• Unsigned permutations
• Example: permutation p = 1 2 3 6 4 5
• First three elements are already in order
• prefix(p) = length of already sorted prefix
– prefix(p) = 3
• Idea: increase prefix(p) at every step
Greedy Algorithm: An Example
• Doing so, p can be sorted
123645
123465
123456
• Number of steps to sort permutation of
length n is at most (n – 1)
Greedy Algorithm:
Pseudocode
SimpleReversalSort(p)
1 for i  1 to n – 1
2 j  position of element i in p (i.e., pj = i)
3 if j ≠i
4
p  p * r(i, j)
5
output p
6 if p is the identity permutation
7
return
Analyzing SimpleReversalSort
• SimpleReversalSort does not guarantee
the smallest number of reversals and
takes five steps on p = 6 1 2 3 4 5 :
•Step
•Step
•Step
•Step
•Step
1:
2:
3:
4:
5:
1
1
1
1
1
6
2
2
2
2
2
6
3
3
3
3
3
6
4
4
4
4
4
6
5
5
5
5
5
6
Analyzing SimpleReversalSort
(cont’d)
• But it can be sorted in two steps:
p = 612345
– Step 1: 5 4 3 2 1 6
– Step 2: 1 2 3 4 5 6
• So, SimpleReversalSort(p) is not optimal
• Optimal algorithms are unknown for many
problems; approximation algorithms used
Approximation Algorithms
• These algorithms find approximate
solutions rather than optimal solutions
• The approximation ratio of an algorithm A
on input p is:
A(p) / OPT(p)
where
A(p) -solution produced by algorithm A
OPT(p) - optimal solution of the problem
Approximation Ratio /
Performance Guarantee
• Approximation ratio (performance
guarantee) of algorithm A: max
approximation ratio of all inputs of size n
– For algorithm A that minimizes objective
function (minimization algorithm):
•max|p| = n A(p) / OPT(p)
– For maximization algorithm:
•min|p| = n A(p) / OPT(p)
Adjacencies and Breakpoints
p = p1p2p3…pn-1pn
• A pair of elements p i and p i + 1 are
adjacent if
pi+1 = pi + 1
• For example:
p=1 9 3 4 7 8 2 6 5
• (3, 4) or (7, 8) and (6,5) are adjacent pairs
Breakpoints: An Example
There is a breakpoint between any adjacent
element that are non-consecutive:
p=1 9 3 4 7 8 2 6 5
• Pairs (1,9), (9,3), (4,7), (8,2) and (2,5)
form breakpoints of permutation p
• b(p) - # breakpoints in permutation p
Adjacency & Breakpoints
•An adjacency – consecutive
•A breakpoint – not consecutive
Extend π with π0 = 0 and πn+1 = n+1
adjacencies
π=5 6 2 1 3 4
0 5 6 2 1 3 4 7
breakpoints
Extending Permutations
• Add p 0 =0 and p n + 1=n+1 at ends of p
Example:
p=1 9 3 4 7 8 2 6 5
Extending with 0 and 10
p = 0 1 9 3 4 7 8 2 6 5 10
Note: A new breakpoint was created after extending
Reversal Distance and
Breakpoints
 Each reversal eliminates at most 2 breakpoints.
p =2 3 1 4 6 5
0
0
0
0
2
1
1
1
3
3
2
2
1
2
3
3
4
4
4
4
6
6
6
5
5
5
5
6
7
7
7
7
b(p) = 5
b(p) = 4
b(p) = 2
b(p) = 0
This implies: reversal distance ≥ #breakpoints / 2
Sorting By Reversals:
A Better Greedy Algorithm
BreakPointReversalSort(p)
1 while b(p) > 0
2 Among all possible reversals,
choose reversal r minimizing b(p • r)
3 p  p • r(i, j)
4 output p
5 return
Problem: this algorithm may work forever
Strips
• Strip: an interval between two consecutive
breakpoints in a permutation
– Decreasing strip:
elements in decreasing order (e.g. 6 5)
– Increasing strip:
elements in increasing order (e.g. 7 8)
0 1 9 4 3 7 8 2 5 6 10
– Consider single-element strips decreasing
except strips 0 and n+1 are increasing
Reducing the Number of
Breakpoints
Theorem 1:
If permutation p contains at least one
decreasing strip,
then there exists a reversal r which
decreases the number of breakpoints
(i.e. b(p • r) < b(p) )
Things To Consider
• For p = 1 4 6 5 7 8 3 2
0 1 4 6 5 7 8 3 2 9
b(p) = 5
– Choose decreasing strip with the
smallest element k in p
(k = 2 in this case)
Things To Consider (cont’d)
• For p = 1 4 6 5 7 8 3 2
0 1 4 6 5 7 8 3 2 9
b(p) = 5
– Choose decreasing strip with the
smallest element k in p
(k = 2 in this case)
Things To Consider (cont’d)
• For p = 1 4 6 5 7 8 3 2
0 1 4 6 5 7 8 3 2 9
b(p) = 5
– Choose decreasing strip with the
smallest element k in p
(k = 2 in this case)
– Find k – 1 in the permutation
Things To Consider (cont’d)
• For p = 1 4 6 5 7 8 3 2
0 1 4 6 5 7 8 3 2 9
b(p) = 5
– Choose decreasing strip with the
smallest element k in p
(k = 2 in this case)
– Find k – 1 in the permutation
– Reverse segment between k and k-1:
0 1 2 3 8 7 5 6 4 9
b(p) = 4
Reducing the Number of
Breakpoints Again
– If there is no decreasing strip, there may
be no reversal r that reduces the number
of breakpoints (i.e. b(p • r) ≥ b(p) for any
reversal r).
– By reversing an increasing strip ( # of
breakpoints stay unchanged ), we will
create a decreasing strip at the next step.
Then the number of breakpoints will be
reduced in the next step (theorem 1).
Things To Consider (cont’d)
• There are no decreasing strips in p, for:
p =0 1 2 5 6 7 3 4 8
b(p) = 3
p •r(3,4) = 0 1 2 5 6 7 4 3 8
b(p) = 3
 r(3,4) does not change the # of breakpoints
 r(3,4) creates a decreasing strip thus
guaranteeing that the next step will
decrease the # of breakpoints.
ImprovedBreakpointReversalSort
ImprovedBreakpointReversalSort(p)
1 while b(p) > 0
2
if p has a decreasing strip
3
Among all possible reversals, choose reversal r
4
5
that minimizes b(p • r)
else
Choose a reversal r that flips an increasing
strip in p
6
p  p • r
7
output p
8 return
Performance Guarantee
• ImprovedBreakPointReversalSort is an
approximation algorithm with a performance
guarantee of at most 4
– It eliminates at least one breakpoint in every
two steps; at most 2b(p) steps
– Approximation ratio: 2b(p) / d(p)
– Optimal algorithm eliminates at most 2
breakpoints in every step: d(p)  b(p) / 2
– Performance guarantee:
( 2b(p) / d(p) )  [ 2b(p) / (b(p) / 2) ] = 4
Signed Permutations
• Up to this point, reversal sort algorithms
sorted unsigned permutations
• But genes have directions… so we should
consider signed permutations
5’
p =
3’
1
-2
-
3
4
-5
Signed Permutation
• Genes are directed fragments of DNA
• Genes in the same position but different
orientations do not have same gene order
• These two permutations are not equivalent
gene sequences
1
-1
2
2
3
4
5
-3
-4
-5
Signed permutations
are easier!
Polynomial time (optimal)
algorithm is known
Genome rearrangements
Mouse (X chrom.)
Unknown ancestor
~ 75 million years ago
Human (X chrom.)
• What are the similarity blocks and how to find
them?
• What is the architecture of the ancestral
genome?
• What is the evolutionary scenario for
transforming one genome into the other?
Genome rearrangements
Mouse (X chrom.)
Unknown ancestor
~ 75 million years ago
Human (X chrom.)
• What are the similarity blocks and how to find
them?
• What is the architecture of the ancestral
genome?
• What is the evolutionary scenario for
transforming one genome into the other?
Comparative maps
A brief history
• Chromosome comparisons
– no information about genes
• 1920’s: Sturtevant, Weinstein
• Today: many organisms, many uses
• Humans:
– primates, mouse, cat, dog, zebrafish, ...
– Alzheimer, cancers, diabetes, obesity, ...
Why construct comparative
maps?
• Identify & isolate genes
– Crops: drought resistance, yield, nutrition...
– Human: disease genes, drug response,…
• Infer ancestral relationships
• Discover principles of evolution
– Chromosome
– Gene family
• “key to understanding the human genome”
Map construction
Go from this
to this
3S
8L
Maize 1 (target),
Rice (base)
Wilson et al.
Genetics 1999
10L
3L
Why automate?
• Time consuming, laborious
– Needs to be redone frequently
• Codify a common set of principles
• Nadeau and Sankoff: warn of “arbitrary
nature of comparative map construction”
Input/Output
• Input:
– genetic maps of 2 species
– marker/gene correspondences (homologs)
• Output:
– a comparative map
• homeologies identified
A natural model?
Maize 1
Rice
3S
8L
10L
3L
Maize 1 (target),
Rice (base)
Wilson et al.
Genetics 1999
Scoring
10L
s
3L
bcd207a (10L)
cdo94b (10L)
bcd386a (10L)
isu78 (5L)
csu77 (10L)
cdo98b (10L)
rz630e (3L)
rz403 (3L)
cdo795a (3L)
isu92b (3L)
m
Assumptions
• Accept published marker order
• All linkage groups of base are unique
• Simplistic homeology criteria
• At least one homeologous region
A natural model?
Dynamic programming
• li = location of homolog to marker i
• S[i,a] = penalty (score) for an optimal
labeling of the submap from marker i to
the end, when labeling begins with label a
1 ...
a
i
...
n
Recurrence relation
• S[n,a] = m (a, ln)
...
...
a
n
ln
S[i,a] = m (a, li) + min (S[i+1,b] + s (a,b) )
bL
a b
... i i+1
li li+1
...
n
ln
Problem with
linear model
s=2
a-b-c motif:
a
a
a
b
a
b
b
b
a
b
score: 2s = 4
c
b
c
c
c
a-b-a motif:
a
a
a
score: 3m = 3
b
a
a
a
The stack model
d
c
a
b
c
e
f
b
b
• Segment at top of the stack can be:
– pushed (remembered), later popped
– replaced
• Push and replace cost s -- pop is free.
Scoring
7L
s
9L
“free”
pop
7L
uaz265a
isu136
isu151
rz509b
cdo59c
rz698c
bcd1087a
rz206b
bcd1088c
csu40
cdo786a
csu154
isu113a
csu17
cdo337
rz530a
(7L)
(2L)
(7L)
(7L)
(7L)
(9L)
(9L)
(9L)
(9L)
(3S)
(9L)
(7L)
(7L)
(7L)
(3L)
(7L)
m
m
m
Dynamic programming
• S[i,j,a] = score for an optimal labeling of:
– submap from marker i to marker j
– when labeling begins with label a -i.e., marker i is labeled a
1 ...
a
i
... j ... n
Recurrence relation
• S[i,i,a] = m (a, li)
• S[i,j,a] = min:
1 ...
a
i i+1
...
n
• m (a, li) + min (S[i+1,j,b] + s (a,b) )
bL
1 ...
a b
i i+1
...
• min S[i,k,a] + S[k+1,j,a]
i<k<j
1 ...
a
a
i ... k+1 ... j ... n
n
Advantage: output similar to
experts’
Maize 6 (target),
Rice (base)
Advantage: proposes testable
hypotheses
• New relations predicted;
greater resolution maps confirm
Maize 7 (target),
Rice (base)
Ahn-Tanksley ‘93
Ahn-Tanksley data
Wilson et. al. ‘99
Advantage: infers
evolutionary events
Wilson
et al.
Maize 1 (target)
Rice (base)
Stack
Problem:
Incomplete input
• Gene order not always fully resolved.
• Co-located genes can be ordered to give
most parsimonious labeling.
8p
19p
33.0
33.0
33.0
33.0
33.0
33.0
33.0
33.0
33.0
33.0
33.0
33.0
33.0
33.0
33.0
33.0
33.0
33.0
33.0
33.0
Atp6b1 (8p)
Comp (19)
Jak3 (19p)
Jund1 (19p)
Lpl (8p)
Mel (19p)
Npy1r (4q)
Pde4c (19)
Slc18a1 (8p)
Srebf1 (17p)
=
Atp6b1 (8p)
Lpl (8p)
Slc18a1 (8p)
Npy1r (4q)
Srebf1 (17p)
Comp (19)
Jak3 (19p)
Jund1 (19p)
Mel (19p)
Pde4c (19)
8p
19p
The reordering algorithm
• Uses a compression scheme
– Within a megalocus, group genes by location
of related gene.
– Order these groups
– First, last groups interact with nearby genes
– Any ordering of internal groups is equally
parsimonious
The reordering algorithm
The reordering algorithm
Definitions
 extended to distance to a set A of labels
(a, A) =
0
1
if a  A,
otherwise
li = set of labels matching markers in
megalocus i
S = set of megalocus start nodes
Definitions
• p(i,a,b) gives the total mismatched marker
and segment boundary penalties attributed
to “hidden markers”
– i is index for megalocus
– a and b are labels for megalocus ends
– Do any markers in megalocus match a, b?
• No: don’t penalize in recurrence and p(i,a,b)
Recurrence relation
S[i,i,a] = m (a, li)
1 ...
a
i i+1
...
n
S[i,j,a] = min:
m (a, li) + min (S[i+1,j,b] + s (a,b) + p(i,a,b))
bL
1 ...
a b
i i+1
...
n
min S[i,k,a] + S[k+1,j,a]
i<k<j
k S
1 ...
a
a
i ... k+1 ... j ... n
Results:
Fewer mismatches
stack
Mouse 5
(target)
Human
(base)
reordering
Results: Mismatches placed
between segments
stack
Mouse 8
(target)
Human
(base)
reordering
Results:
Detects new segments
stack
Mouse 13
(target)
Human
(base)
reordering
Summary
• Global view
• Finds optimal comparative map
– Arranges markers in most parsimonious way
• Biologically meaningful results
• Robust
– not species-specific
– high/low resolution, genetic/physical maps
– stable to errors in marker order