Transcript s - IMPBio
Analyse comparative des génomes
de primates: mais où est donc
passée la sélection naturelle ?
ACI-IMPBIO 4-5 octobre 2007
Laurent Duret, Nicolas Galtier, Peter Arndt
What’s in our genome ?
• 3.1 109 bp
• Repeated sequences: ~50%
• 20,000-25,000 protein-coding genes
• Protein-coding regions : 1.2%
• Other functional elements in non-coding
regions: 4-10%
How to identify functional
elements ?
What make chimps
different from us ?
30 106 point substitutions + indels +
duplications (copy number variations)
• What are the functional elements
responsible for adaptative
evolution ?
Genome annotation by comparative
genomics
• Basic principle :
– Functional element <=> constrained by natural
selection
– Detecting the hallmarks of selection in genomic
sequences
• Negative selection (conservation)
• Positive selection (adaptation)
Evolution : mutation, selection, drift
Base modification,
replication error, deletion,
insertion, ... = premutation
DNA repair
Substitution
Individual
Mutation
germline
soma
transmission to the offspring
(polymorphism)
Population (N)
Fixation
no transmission to
the offspring
Loss of the allele
Evolution : mutation, selection, drift
Probability of fixation:
p = f(s, Ne)
s : relative impact on fitness
s = 0 : neutral mutation (random genetic drift)
s < 0 : disadvantageous mutation = negative (purifying) selection
s > 0 : advantageous mutation = positive(directional) selection
Ne : effective population size: stochastic effects of gamete sampling
are stronger in small populations
|Nes| < 1 : effectively neutral mutation
Demonstrate the action of selection =
reject the predictions of the neutral model
Base modification,
replication error,
deletion, insertion, etc.
Substitution
Individual
Mutation
Substitution rate
=
f(mutation rate, fixation probability)
Polymorphism
|Nes| < 1 :
substitution rate = mutation rate
Population (Ne)
Fixation
Tracking natural selection ...
• Mutation rate: u
• Substitution rate: K
• Negative selection => K < u
• Neutral evolution => K = u
• Positive selection => K > u
How to estimate u ?
=> Use of neutral markers
Tracking natural selection ...
• Synonymous substitution rate: Ks
• Non-synonymous substitution rate: Ka
• Hypothesis: synonymous sites evolve (nearly)
neutraly
Ks ~ u
• Negative selection => Ka < Ks
• Neutral evolution => Ka = Ks
• Positive selection => Ka > Ks
Tracking natural selection ... is
not so easy
• Patterns of neutral substitution vary along
chromosomes
– Impact of molecular processes (replication,
DNA-repair, transcription, recombination, …)
– Genomic environment (susceptibility to
mutagens)
Mammalian genomic landscapes
chromosome 19
100 kb
GC%
60
50
40
30
0
200
Sliding windows : 20 kb, step = 2 kb
400
kb
600
800
1000
chromosome 21
• Large scale variations of base composition along
chromosomes (isochores)
GC content variations affect both coding
and non-coding regions
3661 human genes from 1652 large genomic
sequences (> 50 kb; average = 134 kb).
Total = 221 Mb (98% non-coding)
What is the evolutionary process
responsible for these large-scale
variations in base composition ?
Variation in mutation patterns ?
• Analysis of polymorphism data: in GC-rich
regions, AT->GC mutations have a higher
probability of fixation than GC->AT
mutations (Eyre-Walker 1999; Duret et al. 2002; Spencer et al. 2006)
Selection ?
• What could be the selective advantage
confered by a single AT->GC mutations in a
Mb-long genomic region ???
Biased Gene Conversion ?
Biased Gene Conversion (BGC)
Molecular events of meiotic recombination
T
Heteroduplex
DNA
G
(G->A)
Non-crossover
Crossover
T
A
(T ->C)
DNA
mismatch
repair
C
G
If DNA mismatch repair is biased (i.e. probability of repair is not 50% in favor of
each base) => BGC
BGC: a neutral process that looks
like selection
• The dynamics of the fixation process for one locus
under BGC is identical to that under directional
selection (Nagylaki 1983)
• BGC intensity depends on:
– Recombination rate
– Bias in the repair of DNA mismatches
– Effective population size
• GC-alleles have a higher probability of fixation than
AT-alleles (Eyre-Walker 1999, Duret et al. 2002, Lercher et al. 2002,
Spencer et al. 2006)
• This fixation bias in favor of GC-alleles increases with
recombination rate (Spencer 2006)
Does BGC affect substitution
patterns ?
• BGC should affect the relative rates of AT->GC vs
GC->AT substitutions in regions of high
recombination
• Relationship between neutral substitution patterns
and recombinaion rate ?
Substitution patterns in the
hominidae lineage
• Human, chimp, macaca whole genome alignments:
– Genomicro: database of whole genome alignments
– 2700 Mb (introns and intergenic regions)
• Substitutions infered by maximum likelihood approach
(collaboration with Peter Arndt, Berlin)
• Substitution rates:
– 4 transversion rates: A->T; C->G; A->C; C->A
– 2 transition rates: A->G; G->A
– transitions at CpG sites: G->A
• Cross-over rate: HAPMAP
GC-content expected at equilibrium
(GC*)
• Equilibrium GC-content : the GC content that
sequences would reach if the pattern of substitution
remains constant over time = the future of GCcontent
• Ratio of ATGC over GCAT substitution rates
(taking into account CpG hypermutability)
GC-content expected at equilibrium
and recombination
R2 = 36%
p < 0.0001
60%
Equilibrium
GC-content
GC*
50%
QuickTime™ et un
décompresseur TIFF (LZW)
sont requis pour visionner cette image.
40%
30%
0
1
2
3
4
5
6
7
8
Cross-Over Rate (cM/Mb)
N = 2707 non-overlapping windows (1 Mb), from autosomes
9
GC-content and Recombination
• Strong correlation: suggests direct causal
relationship
• GC-rich sequences promote recombination ?
– Gerton et al. (2000), Petes & Merker (2002), Spencer et al. (2006)
• Recombination promotes ATGC substitutions ?
GC-content and recombination
70%
N = 2707
R2 = 14%
p < 0.001
60%
Present GCcontent
QuickTime™ et un
décompresseur TIFF (LZW)
sont requis pour visionner cette image.
50%
40%
0
1
2
3
4
5
6
7
8
Cross-Over Rate (cM/Mb)
9
GC-content expected at equilibrium
and recombination
R2 = 36%
p < 0.0001
60%
Equilibrium
GC-content
GC*
50%
QuickTime™ et un
décompresseur TIFF (LZW)
sont requis pour visionner cette image.
40%
30%
0
1
2
3
4
5
6
7
8
Cross-Over Rate (cM/Mb)
N = 2707 non-overlapping windows (1 Mb), from autosomes
9
Recombination and GC-content
Molecular events of meiotic recombination
• Recombination events:
crossover + non-crossover
• Genetic maps: crossover
Non-crossover
Crossover
=> The correlation between GC* and crossover rate
might underestimate the real correlation between
GC* and recombination
Evolution of GC-content:
distance to telomeres
Equilibrium
GC-content
GC*
0.60
0.50
N = 2707
R2 = 41%
p < 0.0001
QuickTime™ et un
décompresseur TIFF (LZW)
sont requis pour visionner cette image.
0.40
0.30
0.1
1
10
100
Distance to Telomere (Mb)
GC* vs. crossover rate + distance telomeres: R2 = 53%
BGC: a realistic model ?
• Recombination occurs predominantly in hotspots that
cover only 3% of the genome (Myers et al 2005)
• Recombination hotspots evolve rapidly (their location is
not conserved between human and chimp) (Ptak et al. 2005,
Winkler et al. 2005)
Can BGC affect the evolution of Mb-long isochores ?
BGC: a realistic model ?
• Probability of fixation of a AT-allele
1 e 2s
q
1 e 4Ns
• Probability of fixation of a GC-allele
1 e 2s
p
1 e 4Ns
• Effective population size N ~ 10,000
• s : BGC coefficient
– Recombination hotspots: s = 1.3 10-4 (Spencer et al. 2006)
– No BGC outside hotspots: s = 0
• Hotspots density: 3% (in average), variations along
chromosomes (0.05% to 10.7% )
• Pattern of mutation: constant across chromosomes
BGC: a realistic model ?
Equilibrium
GC-content
GC*
Observations
Predictions of the BGC model
Crossover rate (cM/Mb)
Summary (1)
• Recombination :
– Strong impact on patterns of substitutions
– drives the evolution of GC-content
• Most probably an consequence of BGC
– Mutation: ! fixation bias favoring GC alleles !
– Selection: ! correlation with recombination rate !
– BGC: all observations fit the predictions of the
model
BGC can affect functional regions
• Fxy gene : translocated in the
pseudoautosomal region (PAR) of the X
chromosome in Mus musculus
X specific
PAR
Recombination rate
normal
extreme
GC synonymous sites
normal
(55%)
very high
(90%)
Amino-acid substitutions in Fxy
Time (Myrs)
80
5’ part of Fxy : 4
60
3’ part of Fxy : 5
40
2
1
1
0
20
0
0
Homo
3
1
1
Rattus M. spretus
0
28
M. musculus
X
Y
PAR
X
Y
PAR
Amino-acid substitutions in Fxy
Time (Myrs)
80
5’ part of Fxy : 4
60
3’ part of Fxy : 5
40
2
1
1
0
20
0
0
Homo
3
1
1
Rattus M. spretus
0
28
M. musculus
28 non-synonymous substitutions, all ATGC
NB: strong negative selection (Ka/Ks < 0.1)
Amino-acid substitutions in Fxy
BGC can drive the fixation
of deleterious mutations
BGC: a neutral process that looks
like selection
• BGC can confound selection tests
HARs: human-accelerated regions
• Pollard et al. (Nature, Plos Genet. 2006) : searching
for positive selection in non-coding regulatory
elements
• Identify regulatory elements that have
significantly accelerated in the human lineage =
HARs
Positive selection in the human
lineage ?
• 49 significant HARs
• HAR1: 120 bp
– Rate of evolution >> neutral rate (18 fixed substitutions
in the human lineage, vs. 0.7 expected)
– Part of a non-coding RNA gene
– Expressed in the brain
– Involved in the evolution of human-specific brain
features ?
Positive selection ?
• GC-biased substitution pattern in HARs
– HAR1: the 18 substitutions are all ATGC changes
– Known functional elements (coding or non-coding) are not
GC-rich
• HAR1-5: no evidence of selective sweep (Pollard et al.
2006)
• HAR1: the accelerated region covers >1 kb, i.e. is
not restricted to the functional element
Positive selection or BGC ?
• HARs are located in regions of high recombination
• Recombination occurs in hotspots (<2 kb)
• Given known parameters (population size, fixation bias), the
BGC model predicts substitution hotspots within
recombination hotspots
HARs = substitution hotspots caused by BGC in
recombination hotspots
Conclusion (1)
Recombination drives the evolution of
GC-content in mammals
GC-rich isochores = result of BGC in
highly recombining parts of the genome
Probably a universal process:
correlation GC / recombination in many
taxa (yeast, drosophila, nematode,
paramecia, …)
Conclusion (2)
BGC => substitution hotspots in
recombination hotspots
Recombination hotspots =
the Achilles’ heel of our genome
Conclusion (3)
Probability of fixation depends on:
- selection
- drift (population size)
- BGC
Extending the null hypothesis of
neutral evolution: mutation + BGC
Galtier & Duret (2007) Trends Genet
Thanks
• Vincent Lombard (Génomicro)
• Nicolas Galtier (Montpellier)
• Peter Arndt (Berlin)
• Katherine Pollard (UC Davis)
Sex-specific effects
• Correlation GC* / crossover rate (deCODE genetic map):
– male: R2 = 31%
– female: R2 = 15%
• The rate of cross-over is a poor predictor of the total
recombination rate in female: more variability in the ratio noncrossover / crossover along chromosomes ?
Human
GC*
Human
R2=0.84
R2=0.66
Chromosome length (Mb)
Crossover rate (cM/Mb)
Chicken
Chicken
Current GC
Crossover rate (cM/Mb)
Crossover rate (cM/Mb)
Chromosome size, recombination
and GC-content
R2=0.82
Chromosome length (Mb)
R2=0.81
Crossover rate (cM/Mb)
Recombination and GC-content:
a universal relationship ?
G+C content vs. chromosome
length: yeast
R2= 61%
Bradnam et al. (1999) Mol Biol Evol
G+C content vs. chromosome
length: Paramecium
GC-content
R2= 67%
Chromosome size (kb)
Evolution of GC-content
• Equilibrium GC-content correlates with ...
– Cross-over rate (HAPMAP): R2 = 36%
– Distance to telomere: R2 = 41%
– Cross-over rate + distance telomeres: R2 = 53%
• Recombination pattern: ratio non-crossover / crossover
higher near telomeres ?
Frequency distribution of GC and AT alleles
GC>AT
AT >GC
0.6
proportion
of SNPs
0.4
0.2
0
<5%
5%-15%
15%-50%
>50%
allele frequency
Distribution expected in absence of fixation bias
NB: the shape of the distribution may vary according
to population history, but should be identical for GC
and AT alleles
Frequency
distribution of AT
and GC alleles at
silent sites
• 410 SNPs with allele
frequency (Cargill et al 1999)
• Chimpanzee as an
outgroup to orientate
mutations
• GC alleles segregate at
significantly higher
frequencies than AT alleles
in GC-median and GC-rich
genes
Duret et al. 2002
Frequency distribution of GC and
AT alleles
• Spencer (2006): analysis of HAPMAP
data (SNPs from 60 unrelated
individuals)
• The fixation bias in favor of GC
increases near recombination hotspots
Frequency distribution of GC
and AT alleles
Average Derived Frequency
Allele AT->GC
Allele AT->AT
Allele GC->GC
Allele GC->AT
Spencer (2006)