Transcript Document

Atelier INSERM – La Londe Les Maures – Mai 2004
DETECTING SELECTION FROM DNA SEQUENCE
POLYMORPHISM DATA
N. GALTIER
CNRS UMR 5171 – Génome, Populations, Interactions, Adaptation
Université Montpellier 2, France
[email protected]
SEQUENCE POLYMORPHISM DATA
population
(species)
SEQUENCE POLYMORPHISM DATA
5 genes
sample
population
(species)
DNA fragment
(locus)
site
4 distinct sequences
(haplotypes)
....ACGGATAGTTAGTGACGATA...
....ACGTATAGCTAGTGACGATA...
....ACGTATAGCTAGTGACGATA...
....ACGGATAGCTAGTGACGATA...
....ACGGATAGCTAGTGACGATC...
*
*
*
3 polymorphic (segregating) sites
SEQUENCE POLYMORPHISM DATA
5 genes
sample
population
(species)
DNA fragment
(locus)
....ACGGATAGTTAGTGACGATA...
....ACGTATAGCTAGTGACGATA...
....ACGTATAGCTAGTGACGATA...
....ACGGATAGCTAGTGACGATA...
....ACGGATAGCTAGTGACGATC...
....CCAGCTAGCTACTGAAGTTG...
outgroup
MUTATIONS SEGREGATING IN A POPULATION (1)
sample
1
mutant allele
frequency
NEUTRAL
0
time
Mutations (black dots) arise at rate 2N.m
N: effective population size
m: mutation rate
Under neutrality, a new mutation reaches fixation with probability 1/2N
This results in a neutral substitution rate of 2N.m / 2N = m (red dots)
The amount of polymorphism in the population at mutation-drift equilibrium
is determined by the N.m product, usually measured as q = 4N.m
MUTATIONS SEGREGATING IN A POPULATION (2)
1
mutant allele
frequency
NEUTRAL
0
1
mutant allele
frequency
PURIFYING
SELECTION
0
time
Purifying (=negative) selection results in :
- a decreased substitution rate
- a decreased amount of polymorphism
- lower allele frequencies
MUTATIONS SEGREGATING IN A POPULATION (3)
1
mutant allele
frequency
NEUTRAL
0
1
mutant allele
frequency
ADAPTIVE
SELECTION
0
Adaptive (=positive) selection results in :
- an increased substitution rate
- a decreased amount of polymorphism
- higher allele frequencies
LINKAGE AND HITCH-HIKING
SELECTIVE SWEEP
sampled neutral locus
linked selected locus
Directional selection decreases polymorphism at linked (neighbour) neutral sites
by increasing the apparent drift.
LINKAGE AND HITCH-HIKING
SELECTIVE SWEEP
sampled neutral locus
linked selected locus
Recombination reduces the effect of selection at neighboring loci.
DETECTING SELECTION BY SEEKING REGIONS
OF "LOW" POLYMORPHISM
Selection reduces polymorphism, but the level of polymorphism is determined
by other factors including population size and mutation rate.
To make sure that selection is acting, one must control for these nuisance factors.
Example: the sliding window strategy
p
DNA fragment
selection or reduced mutation bias?
HITCH-HIKING MAPPING
POPULATIONS (distinct N's)
1
LOCI
(distinct m's)
A
B
C
D
E
F
2
3
4
5
0.05
0.07
0.20
0.13
0.05
0
0.06 0.10
0.11
0.03
A selective sweep occurred at locus D in population 3
The low amount of polymorphism at locus D, pop 3 cannot be explained by:
- reduced population size (other loci show high polymorphism in pop 3)
- low mutation rate
(other pops show high polymorphism at locus D)
THE HKA TEST
focal species
Locus A
outgroup
focal species
outgroup
Locus B
Selection has influenced polymorphism at one of the two loci.
The reduced amount of polymorphism at locus B cannot be explained by:
- reduced population size (locus A shows high polymorphism)
- low mutation rate
(the distance to outgroup is not reduced)
THE McDONALD-KREITMAN TEST
focal species
polymorphic
fixed
synonymous
5
2
non-synonymous
4
8
outgroup
The ratio of nonsynonymous to synonymous is higher between species (divergence)
than within species (polymorphism), when the two ratios should be equal under neutrality:
positive selection has promoted the fixation of nonsynonymous changes.
COALESCENCE THEORY : FOCUSING ON SAMPLE GENEALOGY
2N chromosomes
1
2
3
Time
..
.
k.N
..
.
COALESCENCE THEORY : THE STANDARD COALESCENT
The genealogy of a sample of size n at a neutral locus in a panmictic population
of constant size 2N should be like:
T2
4N (on average)
2N (on average)
T3
T4
T5
where
- all topologies are equiprobable
- coalescence times Ti’s are exponential random variables
of expectation E(Ti)=4N/(i.(i-1))
- mutations are superimposed onto the genealogy according to
a Poisson process
THE COALESCENCE PROCESS HAS A HIGH VARIANCE
T2 distribution
Two realisations of the coalescent with equal Tn, Tn-1, …, T3, but distinct T2
DEPARTURE FROM NEUTRALITY : THE SELECTIVE SWEEP EXAMPLE
linked selected
sampled neutral
SELECTIVE SWEEP
sweep
neutral genealogy
"complete" selective sweep :
star-like genealogy
DEPARTURE FROM NEUTRALITY : THE SELECTIVE SWEEP EXAMPLE
linked selected
sampled neutral
SELECTIVE SWEEP
sweep
neutral genealogy
"partial" selective sweep :
partly star-like genealogy
DEPAULIS’ HAPLOTYPE TEST
neutral genealogy
9 polymorphic sites
8 haplotypes
"partial" selective sweep :
partly star-like genealogy
9 polymorphic sites
3 haplotypes
A partially star-like genalogy results in a number of haplotypes lower than expected
given the number of polymorphic sites.
Other test statistics aiming at detecting non-neutral shapes of genealogy were proposed:
Tajima's D, Fu and Li's F, Fay and Wu's H, ...
DEMOGRAPHY vs SELECTION
Detecting a departure from the standard coalescent means that at least one of its
assumptions are wrong. Neutrality, unfortunately, is only one of them.
Demographic effects (departure from the constant-population size assumption)
can distort genealogies in a way very similar to selection.
A bottleneck (sudden decrease of population size, followed by a restauration
of the former size), for example, has consequences highly similar to that of
a selective sweep.
To distinguish: multi-locus analysis.
Demography impacts the whole genome, while selection is locus-specific.
A LIKELIHOOD-BASED APPROACH
M1: neutral, constant size
p parameters (q1, ..., qp)
M2: bottleneck
p+2 parameters (T, S, q1, ..., qp)
M3: selective sweep
3p parameters
(T1, S1, q1, ... , Tp, Sp, qp)
T
T3
T1
T2=
Calculate and compare the likelihood (probability of the data) under the three models
using a likelihood ratio test.
WHAT I DID NOT TALK ABOUT
- subdivided populations, migration, isolation by distance, hybrid zones, clines
- other forms of selection (e.g. balancing selection)
- weak selection applying at many loci (e.g. codon usage)
- (biased) gene conversion
- patterns of linkage disequilibrium, coalescent with recombination
- microsatellites and other non-sequence genetic markers