Transcript Document

Identifying conserved segments
in rearranged and divergent
genomes
Bob Mau, Aaron Darling, Nicole T. Perna
Presented by Aaron Darling
Comparing genomic architectures
Genome sequence and architecture comparison
can lead to insight about organismal
• Evolutionary forces
• Gene functions
• Phenotypes
Rearrangement, gene gain, loss, and duplication
obfuscate homology
Structure of the bacterial chromosome
Origin of
replication
Breakpoints of inversions
occur an equal distance from
the origin to maintain
replichore balance.
Replication proceeds
simultaneously on
each “replichore”
(Tillier and Collins 2000, Ajana
et. al. 2002)
We call such rearrangements
“symmetric inversions”
Terminus
Replichore size difference > 20% is selected against (Guijo et. al. 2001)
A dot plot: Each dot is a pairwise (or n-way) local alignment
Blue:
Same
strand
Red:
Opposite
strand
Goal: Identify local homologous (orthologous) segments
Tools for segmental homology detection
GRIMM-Synteny
-
(Pevzner et. al. 2003, Bourque et. al. 2004)
cluster markers within a fixed distance
FISH
-
(Vision et. al. 2003)
find statistically over-represented
clusters of markers within a fixed distance
LineUp
-
(Hampson et. al. 2003)
find collinear runs of markers among
pairs of genomes, allowing degeneracy
Some alignment tools:
Shuffle-LAGAN (Brudno et. al. 2003),
Mauve (Darling et. al. 2004)
Small segments separated by
lineage-specific regions may not
be detected by methods based
strictly on distance.
Key idea: use a combination of
conserved marker order
(collinearity) and alignment score
Finding conserved regions:
A pseudo-Gibbs sampler method
Given: A set of M monotypic markers M
Do: Assign a posterior probability that any marker m є
M is part of a conserved region
Use MCMC methodology to sample the frequency of
each marker’s inclusion in high-scoring configurations.
Use frequency as an estimate of “posterior probability”
Finding conserved regions:
A pseudo-Gibbs sampler method
Define a configuration X as a vector of length M of
binary random variables:
e.g. X = ( X1, X2, …, XM )
A configuration value xj maps marker mj to either
signal (1) or noise (0)
e.g. x = (0,1,0,0,1,1,…,1,0)
There are 2M possible configurations
Run a Markov chain of length N over configuration
space:
(X1, X2, …, XN)
Sample possible marker configurations
Start with a random initial configuration, THEN:
Select a marker, sample whether it should be a 0
or 1 based on the current configuration
j 1
Score(m j | x)   wv xv  w j 
vL
Sum of scores for all
collinear markers to the left
Score of
marker j
R
w x
v  j 1
v v
Sum of scores for all
collinear markers to the right
wv is the score of marker v, xv is the configuration value (0 or 1)
Transform LCB score to probability
The scale parameter c
is used in tandem with
the sigmoid to map a
marker’s score to a
probability:
P( X
n 1
j
e
1
 1 | x )  Score( m j ) / c
e
1
Score( m j ) / c
n
Sample a new value for xj
Set xj to 1 with probability given by the marker’s
score transformation
First allow the chain a “burn-in” period, then
continue for many iterations.
The frequency, or “posterior probability” of mj is:
# of samples  1
# of samples
Our method assigns each marker a p.p.
Threshold γ separates signal from noise
Our method assigns each marker a p.p.
Using γ = .5, the X pattern appears
Our method assigns each marker a p.p.
Using γ = .5, the X pattern appears
Application to 4 divergent Streptococcus
Markers are reciprocal best blastp hits of ORFs among:
S. agalactiae
S. pyogenes
S. pneumoniae
S. mutans
S. pneumoniae
What is the distribution of segment sizes in
Streptococci?
“Medium resolution”
c = 20, γ = .50
“High-1 resolution”
c = 20, γ = .30
“High-2 resolution”
15
10
5
0
Number of LCBs
c = 30, γ = .45
25
20
Number of LCBs
“Low resolution”
Total
Segments
35
30
6
0
0
0
1
2
1
2
3
4
5
6
7
8
2
9
5
10
3
1
0
2
1
2
11
13
14
17
18
24
1
1
0
2
1
2
26
35
30
25
20
15
10
5
0
2
4
6
0
2
3
4
5
6
7
7
2
6
1
4
32
0
7
8
9
10
11
13
14
17
18
24
2
3
1
2
0
0
1
2
9
10
11
13
14
17
18
24
1
3
1
2
1
0
0
2
9
10
11
13
14
17
18
24
35
30
25
20
20
15
10
5
1
4
2
6
57
0
2
Number of LCBs
c = 75, γ = .45
Number of LCBs
As resolution increases, large segments are broken up by
smaller segments
3
35
5
6
7
7
7
8
29
30
25
20
15
10
5
4
11
3
2
6
0
2
3
4
5
6
7
8
Segment sizes (Markers per segment)
72
What was the ancestral genome
organization?
Try building inversion phylogeny by applying GRIMM
and MGR to the 57 high resolution segments
What was the ancestral genome
organization?
Try building inversion phylogeny by applying GRIMM
and MGR to the 57 high resolution segments
Failed: The suggested rearrangements do not
maintain replichore balance
What was the ancestral genome
organization?
Try building inversion phylogeny by applying GRIMM
and MGR to the 57 high resolution segments
Failed: The suggested rearrangements do not
maintain replichore balance
Try using the 26 larger, low resolution segments
Surprise! A success:
Transforming S. agalactiae into S. pyogenes
Conclusions
- The pseudo-Gibbs sampler method detects
collinear segments at a variety of scales
- It would be nice to have an inversion phylogeny
inference tool that accounts for replichore balance!
- Large segments in Streptococci appear to
rearrange by symmetric inversions
- Small segments? An open problem.
Future directions
Can a biologically relevant full joint probability
distribution be expressed over configurations?
- If so, then a true Gibbs sampler could be
employed
Problems:
- Some rearrangements occur with different
frequency (e.g. symmetric inversions about the
terminus vs. IS-mediated translocation)
- Distinguish rearrangement by H.T., gene
duplication and subsequent loss, symmetric
inversion, etc.
Acknowledgements
Bob Mau – did most of this work
My Ph.D. advisers:
Nicole T. Perna and Mark Craven
Others who have contributed insight:
Jeremy Glasner, Fred Blattner, Eric Cabot
GEL@UW-Madison
Grant $. Money : NIH Grant GM62994-02.
NLM Training Grant 5T15M007359-03 to A.E.D.