A Ruby in the Rubbish:
Using molecular data to look
for signatures of selection
Bruce Walsh, [email protected]
University of Arizona
Depts. of Ecology & Evolutionary Biology,
Molecular & Cellular Biology,
Epidemology & Biostatistics
Search for Genes that experienced
artificial (and natural) selection
Akin in sprit to testing candidate genes
for association or using genome scans to find
In linkage studies: Use molecular markers
to look for marker-trait associations (phenotypes)
In tests for selection, use molecular markers
to look for patterns of selection (patterns
of within- and between-species variation)
Types of Genes that have experienced
Selection in Crop species
Domestication genes: Alleles fixed in the course
of the initial domestication
Diversification/Improvement genes: Alleles fixed
in the course of improvement following
Adaptation genes: Alleles in natural populations
responding to natural selection on environmental
conditions (candidates to transfer into elite
Searches for regions under selection complement
standard linkage-based approaches for QTL
detection (line-crosses, association mapping)
Using QTL approaches to find domestication genes
requires making crosses of wild progenitor x
Localizing adaptation genes to a particular environment via a
standard QTL cross very difficult, as one would miss
potential pathways to adaptation by focusing only candidate
phenotypes thought of by the investigator.
The general approaches for using sequence
data to search for signs of selection
Key: Use of features of variation at a marker
locus to test for departures from strict neutrality
• Tests based on pattern and amount of withinspecies polymorphism (departures from neutral
predictions). On-going or recent selection
• Tests based on polymorphism plus between
species divergence. On-going or recent selection
• Tests based on phylogenetic comparisons between
species. Historical selection (won’t discuss these further)
A quick review of the neutral theory
(expected patterns of variation under drift)
• Drift and the coalescence process (its about time)
• Mutation-drift equilibrium (within-population
variation). Function of population size and
mutation rate. Expected variation = H = 4Nem
• Divergence between populations (betweenpopulation variation). Function of time and
mutation rate (but not population size), d = 2tm
Mutation-Drift Equilibrium (Single Loci)
Drift removes variation, while mutation
introduces it. Thus, an equilibrium amount
of genetic variance results
While alleles change over time, heterozygosity
remains roughly constant.
If Ne is the effective population size and
m the mutation rate, Crow & Kimura showed
the equilibrium heterozygosity is given by
4N e π
4N e π + 1
Thus, H is simply a product of population size
and mutation rate. The parameter 4Nem is
a fundamental one in molecular evolution and
often denoted by q.
A very powerful way of thinking about drift
is the Coalescent Process
Instead of following alleles, think in terms
As a consequence of drift, eventually all
current copies of alleles trace back to a
single ancestral lineage.
Hence, the current lineages coalesce as
one moves back in time
MRCA = most recent
Coalescent theory provides an easy way
to see why 4Nem appears.
Expected number of
mutations = 2tm
For two random
sequences within a
population, t = 2Ne
giving 2tm = 4Nem
From coalescent theory, the expected
Time back to the MRCA is 2N generations
Hence, for two randomly-chosen sequences,
the expected number of mutations they
differ by is just
2mt = 2m(2N) = 4Nm
If 4Nm >> 1, two random sequences will
typically differ (and hence be heteroygotes)
If 4Nm << 1, two random sequences will
typically differ (and hence be homoygotes)
The Coalescent for a Sample
For k-th coalescent event, qk =k(k-1)/4N
Mean total time = N (1/5+1/3+2/3+2) = 3.2N
Divergence Between Populations
Mutation and drift also generate a betweenline variance, i.e., a population divergence
As lines separate, the initial heterozygosity is
randomly partitioned, creating a between-line
More importantly, as new mutations arise in the
separated lines, some of these are fixed by
drift, and this drives a constant divergence
One average, for a population of size N,
2Nm mutations arise each generation
For any of these, their probability of fixation
is just U(1/[2N]) = 1/(2N)
Hence, the rate at which new mutations are
fixed within a line is just
(# new per generation)*Pr(fixation)
2Nm*1/(2N) = m
Hence, divergence d(t) after t generations is
just d(t) = mt
Independent of population size!
Logic behind polymorphism-based tests
Key: Time to MRCA relative to drift
If a locus is under positive selection, more
recent MRCA (shorter coalescent)
If a locus is under balancing selection, older
MRCA relative to drift (deeper coalescent)
Shorter coalescent = lower levels of variation,
longer blocks of disequilibrium
Deeper coalescent = higher levels of variation,
shorter blocks of disequilibrium
Selective sweeps result in a local decrease
in Ne around the selective site
This results in a shorter time to MRCA and
a decrease in the amount of polymorphism
Note that this has no effect on the rate
of divergence of netural sites , as this is
independent on Ne.
Conversely, balancing selection increases
the effective population size, increasing
the amount of polymorphism
A scan of levels of polymorphism can thus
suggest sites under selection
Local region with
reduced mutation rate
Local region with
elevated mutation rate
Example: maize domestication gene tb1
in plant aarchitecture
Wang et al. (1999) observed a significant decrease
in genetic variation in the 5’ NTR region of tb1,
suggesting a selective sweep influenced this
region. The sweep did not influence the coding
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Wang et al (1999) Nature 398: 236.
Clark et al (2004) examined the 5’ tb1 region
in more detail, finding evidence for a
sweep influencing a region of 60 - 90 kb
Clark et al (2004) PNAS 101: 700.
Wang et al. and Clark et al. controlled for
the reduction in neutral polymorphisms being
due simply to reduced mutation rate by
using a close relative (teosinte) as a control.
The process of domestication itself is expected
to reduce variation genome-wide because of
the population bottleneck that is typically
induced during domestication. In maize, the
background level of polymorphism (genome wide)
is only about 75% of that of teosinte.
Estimating strength of selection
from size of sweep region
Kaplan, Hudson, and Langley (1989) showed that the
distance d at which a neutral site can be influenced by
a sweep is a function of the strength of selection s and
the recombination fraction c, with d = 0.01 s/c.
Hence, s = 100 . d . c
For tb1, s -> 0.05.
With s in hand, one can also estimate the expected
time for selection to fix the allele, which Wang et al.
estimated at 300 to 1000 years, indicating a fairly long
period of domestication.
Example: Waxy gene in Rice (Olsen et al. 2006)
“Sticky” (glutinous) rice results from low amylose
levels, and are typical of temperate japonica variety
A number of groups showed this is due to a splice
mutant in the Waxy gene. This is an example of an
improvement (as opposed to domestication) gene
Olsen et al. observed a region 250kb in size around
Waxy with a greatly reduced level of polymorphism
compared to control populations.
Using the Kaplan et al expression, this gives s = 4.6!
While the sweep around tb1 did not even influence
the coding region of that gene, the waxy sweep
covers 39 rice genes!
One evolutionary consequence of a sweep is that
the reduction in population size (that produces the
signal of a sweep) also reduces the efficiency of
selection on linked genes within the region
(the Hill-Robertson effect)
Deleterious alleles have a higher probability
Favorable alleles have a reduced probablity
Accumulation of Deleterious
mutations in domesticated rice genomes?
Lu et al (2006) compared the genomes of Oryza sativa ssp. indica and
japonica with their ancestral relative O. rufipogon.
The Ka/Ks (ratio of the substitution rate of non-synonymous to
synonymous changes) was much higher for indica vs. japonica (0.498)
than for domesticated vs. wild rice (japonica vs. rufipogin, 0.259)
Lu et al suggest that roughly 25% of the amino acid differences
between indica and japonica are deleterious.
They suggest that excessive reductions in Ne due to selectivesweeps covering much of the genome during selection for
domestication greatly reduced the efficiency of natural selection
in removing deleterious alleles.
Formal tests of selection
• Tajima’s D. Requires: single-locus,
within-population polymorphism data
• McDonald-Kreitman Test. Requires:
coding region, data from 2 species (withinpopulation variation, btw species divergence)
• Hudson-Kreitman-Aguade (HKA) test.
Requires: at least two loci, data from 2 species
(within-population variation, btw species divergence)
• Allele frequency vs. LD tests. Requires: dense
marker scan around a single-locus using
Tests based on Within-Population Variation
These tend to compare different measures of variation
(such as number of alleles vs. pair-wise distances among
Two sequence evolution frameworks are typically used:
infinite alleles vs. infinite sites.
Both assume each new mutation generates a new (unique)
sequence. (such is not the case for STRs)
How do these frameworks differ?
Consider the following five sequences
A A G A C C 1
A A G G C C 2
Infinite alleles: Treat each
different haplotype as a
A A G A C C
A A G G C C
2 Here, there are three alleles
A A G G C A
Infinite sites model: Treat each site (base
position) separately. How many polymorphic
sites are there?
Here, 2 polymorphic sites
Two typical classes of departures are seen with
1: An excess of rare alleles, a deficiency of intermediate
frequency alleles (alleles younger than expected)
2: An excess of intermediate frequency alleles, a
deficiency of rare alleles (alleles older than expected)
Pattern 1 expected under a selective sweep, when
coalescent times are shorter than expected
Pattern 2 expected under balancing selection, when
coalescent times are longer than expected
Major Complication With
Demographic factors can also cause these
departures from neutral expectations!
Too many young alleles -> recent population
Too many old alleles -> population substructure
Thus, there is a composite alternative hypothesis,
so that rejection of the null does not imply selection.
Rather, selection is just one option.
Can we overcome this problem?
It is important to, as only polymorphismbased tests can indicate on-going selection
Solution: demographic events should leave a
constant signature across the genome
Essentially, all loci experience common
Genome scan approach: look at a large number
of markers. These generate null distribution
(most not under selection), outliers = potentially
Summary Statistics for Infinite Sites Model
The key parameter is q = 4Nem
• S, number of segregating sites. E(S) = anq an =
• k, average number of pairwise differences . E(k) = q
• h, number of singletons. E(h) = q * n/(n-1)
These suggest the following three estimates for q:
qbk = k;
Tajima’s D test
One of the first, and most popular, polymorphism
tests was Tajima’s D test (Tajima 1989)
D contrasts estimates of q based on S vs. k
D = p
qbk ° qbS
ÆD S + ØD S2
Idea: For S we simply count sites, independent of
their frequencies. Hence, S rather sensitive
to changes in the frequency of rare alleles.
On the other hand, k is a more frequencyweighted measure, and hence more sensitive
to changes in the frequency of intermediate
D < 0: too many rare alleles. Selective sweep
or population expansion. MRCA more recent
D > 0: too many intermediate-frequency alleles.
Balancing selection or population subdivision.
MRCA more ancient than expected.
D is a test whether the amount of polymorphism
is consistent with the number of polymorphisms
Under selective sweeps/population expansion,
heterozygosity should be significantly less
than predicted from number of polymorphisms
Genome-Wide Polymorphism Tests
As mentioned, general problem with polymorphism
tests is that demographic signals can also give the same
pattern as selection.
Cavalli-Sforza (1966) was among the first to note that
demography effects all genomic locations (roughly)
equally, while the effects of selection are unique to
a particular locus
With the advent of very dense marker sets, we are
now seeing genome-wide scans over all markers.
Idea: Most are not under selection and hence reflect
the common demographic features. Outliers against this
pattern suggest selection.
Logic behind Joint
Under the neutral theory, heterozygosity is a
function of q = 4Ne m, while divergence is
a function of mt
Joint Polymorphism-Divergence tests use these
two different expectations to look for concordance
with neutral results.
For example, under neutrality, levels of polymorphism
and divergence should be positively correlated.
Under neutrality, the ratio of polymorphism
to divergence at the i-th locus is just
4N e πi
Hence, for a series of neutral loci compared in the
same populations, this ratio should be very similar.
The very popular Hudson, Kreitman and
Aguade (1987), or HKA test, is based on this
idea, with one using a series of controlled
(neutral) loci to contrast with the locus of
Joint Polymorphism-Divergence Tests
One of the most straight-forward tests of selection
that jointly uses divergence and polymorphism data
was proposed by McDonald and Kreitman (1991)
Consider the replacement & synonymous sites in a single
H sy n
4N e πsy n
H r ep
4N e πr ep
These ratios have the same
Since these ratios have the same expected
value, the McDonald-Kreitman test proceeds
via a simple contingency table contrasting
polymorphism vs. divergence at replacement
vs. synonymous sites.
Key feature: The McDonald-Kreitman test
is NOT affected by demography
Example: McDonald & Kreitman looked at the ADH
(Alcohol dehydrogease) loci in D. melanogaster &
24 fixed differences occur, 7 replacement, 17
44 polymorphisms, 2 replacement, 42 synonymous, giving
Fisher’s exact test gives p =0.0073
Wang et al’s LDD Test
(Linkage Disequilibrium Decay)
One feature of a selective sweep are derived alleles
at high frequency. Under neutrality, older alleles
are at higher frequencies.
Sabeti et al (2002) note that under a sweep such high
frequency young alleles should (because of their recent
age) have much longer regions of LD than expected.
Wang et al (2006) proposed a Linkage Disequilibrium
Decay, or LDD, test looks for excessive LD for high
Wang et. al used this approach with 1.6 million human
SNPs, finding that 1.6% of the markers showed some
signatures of positive selection.
Simulation studies by Wang et al. showed that the
LDD test effectively distinguishes selection from
population bottlenecks and admixture.
All genome-based tests have an important caveat.
The large number of markers used are typically
generated by looking for polymorphisms in a very
small, and often not very ethnically-diverse, sample
Results in a strong ascertainment bias, for example,
an excess of intermediate-frequency markers
If such biases are not accounted for, they can
skew test results.
Caveats and Unanswered Questions
• Even if they have experienced very strong
selection, domestication genes may not leave
a strong signal at linked neutral markers.
Must be sufficient background variation for the
chance of a sweep being detected.
Hamblin et al. (2006) found that the genome-wide
background variation in Sorghum is too low to reliably
detect signatures of selection. Likely from extreme
bottleneck during domestication.
If the ancestral species itself had low variation, would
also be very difficult to detect selective sweeps.
• A more subtle complication results from the frequency
of favorable alleles at the start of the domestication
A typical adaptive selective sweep is generally
thought to occur following the introduction of a
single favorable new mutation. Hence, only one
founding haplotype at the time of selection.
Selection on domestication alleles is akin to a sudden
shift in the environment, with many of these alleles
pre-existing in the population before domestication
If the frequency of any such an allele is > 0.05, multiple
haplotypes are likely present, resulting in considerable
variation around the selective site even after fixation,
and hence a very weak (if any) signal.
Hence, there is the very real possibility
than many important domestication genes
will not have left a detectable signature in
the pattern of linked neutral variation.
Optimal conditions for detecting selection
High levels of polymorphism at the start of
High effective levels of recombination gives
a shorter window around the selective site
High levels of selfing reduces the effective
recombination rate (eg. Maize vs. rice)
Signatures of sweeps persist for roughly Ne
Domestication vs. improvement genes
• Domestication genes will leave a signal in all lines,
while improvement genes may leave a live-specific
Unresolved question: Is selection stronger
on domestication or improvement genes?
Domestication gene tb1: 90kb sweep, s = 0.05
Improvement gene Y1: 600kb sweep, s = 1.2
Linkage mapping vs. detection of selected loci
Linkage: Know the target phenotype
Selection: Don’t know the target phenotype
Both can suffer from low power and confounding
from demographic effects
Both can significantly benefit from high-density
genomic scans, but these are also not without