Genome evolution: a sequence
Download
Report
Transcript Genome evolution: a sequence
Genome Evolution © Amos Tanay, The Weizmann Institute
Genome evolution
Lecture 2:
population genetics I: drift and mutation
Genome Evolution © Amos Tanay, The Weizmann Institute
Studying Populations
Models:
A set of individuals, genomes
Ancestry relations or hierarchies
mtDNA human migration patterns
Experiments:
Fields studies, diversity/genotyping
Experimental evolution
Åland Islands, Glanville fritillary population
Genome Evolution © Amos Tanay, The Weizmann Institute
Population genetics
Drift: The process by which allele frequencies are changing through
generations
Mutation: The process by which new alleles are being introduced
Recombination: the process by which multi-allelic genomes are mixed
Selection: the effect of fitness on the dynamics of allele drift
Epistasis: the drift effects of fitness dependencies among different
alleles
“Organismal” effects: Ecology, Geography, Behavior
Genome Evolution © Amos Tanay, The Weizmann Institute
The Hardy-Weinberg Model
•
Diploid organisms
Two copies of each allele/gene/base
Homozygous / Heterozygous
•
Sexual Reproduction
Mating haplotypes
•
Large population, No migration
Fixed size, closed system
•
Non-overlapping generations
Synchronous process
Not as bad as it may look like
•
Random mating
New generation is being selected from the existing haplotypes with
replacement
•
No mutations, no selection (will add these later)
Genome Evolution © Amos Tanay, The Weizmann Institute
The Hardy-Weinberg Model
•
Non-overlapping generations
Synchronous process
Not as bad as it may look like
•
Random mating
New generation is being selected from the existing haplotypes with
replacement
•
No mutations, no selection (will add these later)
Hardy-Weinberg equilibrium:
AA
aa
Aa
aA
Random mating
P ( A) p
P(a) q
Non overlapping
generations
AA
aa
Aa
aA
P( AA) p 2
P( Aa) 2 pq
P(aa) q 2
With the model assumption, equilibrium is reached within one generation
Genome Evolution © Amos Tanay, The Weizmann Institute
Frequency estimates
We will be dealing with estimation of allele frequencies.
To remind you, when sampling n times from a population with allele of
frequency p, we get an estimate that is distributed as a binomial
variable. This can be further approximated using a normal
distribution:
n
B( p; n) p i (1 p) n i
i
V ( B( p; n)) N (np, np(1 p) )
When estimating the frequency out of the number of successes we
therefore have an error that looks like:
s
pˆ (1 pˆ )
n
Genome Evolution © Amos Tanay, The Weizmann Institute
Testing Hardy-Weinberg using chi-square statistics
HW is over simplifying everything, but can be used as a baseline to test
if interesting evolution is going on for some allele
Classical example is the blood group genotypes M/N (Sanger 1975) (this
genotype determines the expression of a polysaccharide on red blood cell surfaces – so
they were quantifiable before the genomic era..):
Observed
HW
MM
298
294.3
MN
489
496
NN
213
209.3
P( AA) p 2
P( Aa) 2 pq
P(aa) q 2
2
(obs exp)
exp
Chi-square significance can be computed from the chi-square
distribution with df degrees of freedom.
Here: df = #classes - #parameters – 1 = 3(MN/NN/MM) – 1 (p) – 1 = 1
2
0.22
Genome Evolution © Amos Tanay, The Weizmann Institute
Wright-Fischer model for genetic drift
∞
gametes
N
individuals
N
individuals
∞
gametes
We follow the frequency of an allele in the population, until fixation (f=2N) or loss (f=0)
We can model the frequency as a Markov process on a variable X (the number of A alleles)
with transition probabilities:
2 N i
i
Tij
1
j
2
N
2
N
j
2N j
Sampling j alleles from a
population 2N population
with i alleles.
In larger population the frequency would change more slowly (the variance of the binomial
variable is pq/2N – so sampling wouldn’t change that much)
Loss
0
1
2N-1
2N
Fixation
Genome Evolution © Amos Tanay, The Weizmann Institute
Drift and fixation probability
Since 0 and 2N are absorbing states, given sufficient time, the wright-fischer
process will converge to either 0 or 2N. Define:
min{ n : X n 0 or X n 2 N}
Theorem (fixation in drift): In the Wright-Fischer model, the probability of fixation
in the A’s allele state, given a population of 2N alleles out of which i are A, is:
Pi ( X 2 N )
i
2N
Proof: The mean of the binomial sample in the n’th step is np:
E ( X n 1 | X n i ) 2 N
i
i Xn
2N
Which means that the expected number of A’s is constant in time. Intuitively:
i Ei ( X ) 2 NPi ( X 2 N )
n
More formally:
i Ei ( X n ) Ei ( X ; n) Ei ( X n ; n) Ei ( X ) o(1)
Genome Evolution © Amos Tanay, The Weizmann Institute
Drift
Figure 7.4
Experiments with drifting fly populations: 107 Drosophila melanogaster populations. Each
consisted originally of 16 brown eys (bw) heterozygotes. At each generation, 8 males
and 8 females were selected at random from the progenies of the previous generation.
The bars shows the distribution of allele frequencies in the 107 populations
Genome Evolution © Amos Tanay, The Weizmann Institute
The coalescent
When sampling K new individuals, the chances of peaking up the same
parent twice is roughly:
k (k 1) 1
1
O( 2 )
2
2N
N
When looking at k individuals, we can trace their coalescent backwards and
ask when did they had k-1,k-2, or one common ancestor.
Theorem: The amount of time during which there are k lineages, tk has
approximately an exponential distribution with mean 2N * (2/(k(k-1)))
Proof: the probability of not merging k
lineages in n generations is:
Past
k (k 1) 1
k (k 1) n
1
exp
2
2
N
2
2
N
t
Which is like an exponential e
1
4N
The expected value is E (e t )
k (k 1)
n
This is correct for any k, so going
backward from present time, we
can estimate the time to coalescent
at each step
E(T2 ) 2 N
E (T3 )
2N
6
2N
E (T5 )
Present
10
E (T4 )
1
2
3
4
5
2N
3
Genome Evolution © Amos Tanay, The Weizmann Institute
The coalescent
The expected time to the common ancestor of k individuals:
E (T1 )
4N
1
1
1
4N
4 N (1 )
k
n
k 2.. n k ( k 1)
k 2.. n k 1
When looking at k individuals, we can trace their coalescent backwards and
ask when did they had k-1,k-2, or one common ancestor.
Theorem: The probability that the most recent common ancestor of a
sample of size n is the same as that of the population converges to (n1)/(n+1) as the population size increase.
Past
E(T2 ) 2 N
4N is the magic number
E (T3 )
2N
6
2N
E (T5 )
Present
10
E (T4 )
1
2
3
4
5
2N
3
Genome Evolution © Amos Tanay, The Weizmann Institute
Diffusion approximation and Kimura’s solution
Fischer, and then Kimura approximated the drift process using a diffusion equation (heat
equation):
( x, t )
The density of population with frequency x..x+dx at time t
J ( x, t )
The flux of probability at time t and frequency x
The change in the density equals the differences between the fluxes J(x,t) and
J(x+dx,t), taking dx to the limit we have:
( x, t ) J ( x, t )
t
x
The if M(x) is the mean change in allele frequency when the frequency is x, and V(x) is
the variance of that change, then the probability flux equals:
1
V ( x) ( x, t )
2 x
1
V ( x) ( x, t )
( x, t ) M ( x) ( x, t )
t
x
2 x 2
J ( x, t ) M ( x) ( x, t )
Heat diffusion
Fokker-Planck
Kolmogorov Forward eq.
M 0, V ( x)
x(1 x)
2N
1
x(1 x) ( x, t )
( x, t )
2
t
4 N x
Genome Evolution © Amos Tanay, The Weizmann Institute
Diffusion approximation and Kimura’s solution
Fischer, and then Kimura approximated the drift process using a diffusion equation (heat
equation). We start with working on the time step dy and frequency step dx
( x, t )
The probability that the population have allele frequency x time t
M (x)
the probability that the frequency increased from x by dx, due to
mutation/selection
V (x) / 2
The probability of dx increase or decrease due to drift
We limit changes from t to t+dt and x+-dx. The population can be on x at t+dt if:
It was at x and stayed there:
( x, t )(1 M ( x) V ( x))
It was at x-dx and moved to x:
( x dx, t )( M ( x) V ( x) / 2)
It was at x+dx and moved to x:
( x dx, t )(V ( x) / 2)
( x, t dt ) ( x, t ) [ M ( x) ( x, t ) M ( x dx) ( x dx, t )]
1
[V ( x dx) ( x dx, t ) V ( x) ( x, t )]
2
1
[V ( x) ( x, t ) V ( x dx) ( x dx, t )]
2
1
V ( x) ( x, t )
( x, t ) M ( x) ( x, t )
t
x
2 x 2
Genome Evolution © Amos Tanay, The Weizmann Institute
Diffusion approximation and Kimura’s solution
Fischer, and then Kimura approximated the drift process using a diffusion equation (heat
equation). We start with working on the time step dy and frequency step dx
( x, t )
The probability that the population have allele frequency x time t
M (x)
the probability that the frequency increased from x by dx, due to
mutation/selection
V (x) / 2
The probability of dx increase or decrease due to drift
1
V ( x) ( x, t )
( x, t ) M ( x) ( x, t )
t
x
2 x 2
For drift the variance is binomial:
And we assume no selection:
Still not easy to solve analytically…
V ( x) x(1 x) / 2 N
M ( x) 0
Genome Evolution © Amos Tanay, The Weizmann Institute
Changes in allele-frequencies, Fischer-Wright model
After about 4N generations, just 10% of the cases are not fixed and the distribution
becomes flat.
Genome Evolution © Amos Tanay, The Weizmann Institute
Absorption time and Time to fixation
According to Kimura’s solution, the mean time for allele fixation, assuming initial probability
p and assuming it was not lost is:
4N
(1 p) log( 1 p)
tˆ1 ( p)
p
The mean time for allele loss is (the fixation time of the complement event):
4N
( p) log( p)
tˆ0 ( p)
1 p
Genome Evolution © Amos Tanay, The Weizmann Institute
Effective population size
4N generations looks light a huge number (in a population of billions!)
But in fact, the wright-fischer model (like the hardy-weinberg model) is based on many nonrealistic assumption, including random mating – any two individuals can mate
The effective population size is defined as the size of an idealized population for which
the predicted dynamics of changes in allele frequency are similar to the observed ones
For each measurable statistics of population dynamics, a different effective population size
can be computed
For example, the expected variance in allele frequency is expressed as:
V ( pt 1 )
pt (1 pt )
2N
But we can use the same formula to define the effective population size given the variance:
V ( pt 1 )
pt (1 pt )
2Ne
Genome Evolution © Amos Tanay, The Weizmann Institute
Effective population size: changing populations
If the population is changing over time, the dynamics will be affect by the harmonic mean of
the sizes:
Ne
t
1
1
1
..
N
N
N
1
t 1
0
So the effective population size is dominated by the size of the smallest bottleneck
Bottlenecks can occur during migration, environmental stress, isolation
Such effects greatly decrease heterozygosity (founder effect – for example Tay-Sachs in
“ashkenazim”)
Bottlenecks can accelerate fixation of neutral or even deleterious mutations as we shall see
later.
Human effective population size in the recent 2My is estimated around 10,000 (due to
bottlenecks). (so when was our T1?)
Genome Evolution © Amos Tanay, The Weizmann Institute
Effective population size: unequal sex ratio, and sex
chromosomes
If there are more females than males, or there are fewer males participating in reproduction
then the effective population size will be smaller:
Na Nm N f
Ne
4Nm N f
Nm N f
Any combination of alleles
from a male and a female
So if there are 10 times more females in the population, the effective population size is
4*x*10x/(11x)=4x, much less than the size of the population (11x).
Another example is the X chromosome, which is contained in only one copy for males.
Ne
1
2
1 p q 4 p f q f
p pm p f , Var ( p) m m
3
3
9 N m 9 2 N f
9Nm N f
4Nm 2N f
1
4
p pm p f ,Var ( p) pq
9N
m 18 N f
pq
2 9 N m N f
4 N m 2 N f
Genome Evolution © Amos Tanay, The Weizmann Institute
Recombination and linkage
Assume two loci have alleles A1,A2, B1,B2
Linkage equilibrium:
Only double Heterozygous can allow
recombination to change allele frequencies:
P( A1 B1 ) p1q1
A1 B1
P( A1 B2 ) p1q2
P( A2 B1 ) p2 q1
A1B1/ A2B2
P( A2 B2 ) p2 q2
A2 B2
A1 B2
A1B2/ A1B2
A2 B1
The recombination fraction r: proportion of recombinant gametes generated from double
heterozygote
For different chromosomes: r = 0.5
For the same chromosome, function of the distance and possibly other factors
Genome Evolution © Amos Tanay, The Weizmann Institute
A1 B2
Linkage disequilibrium (LD)
A2 B1
P11 P( A1B1 ), P12 P( A1B2 ), P21 P( A2 B1 ), P22 P( A2 B2 )
r
A1 B1
Recombination on any A1- / -B1
A2 B2
No recomb
Next generation:
P11' (1 r ) P11 rq1 p1
A1 B1
P11' q1 p1 (1 r )( P11 p1q1 )
Define the linkage disequilibrium parameter D as:
A2 B2
1-r
A1 B1
D
A2 B2
D P11 p1q1
Dn (1 r ) Dn 1 (1 r ) n D0
r=0.05
r=0.5
r=0.2
D P11P22 P12 P21
Generation
Genome Evolution © Amos Tanay, The Weizmann Institute
Linkage disequilibrium (LD) - example
blood group genotypes M/N and S/s. Both alleles in Hardy-Weinberg
For M/N –
For S/s –
p1 = 0.5425
q1 = 0.3080
Observed
p2 = 0.4575
q2 = 0.6920
unlinked
MS
484
334.2
Ms
611
750.8
NS
142
281.8
Ns
773
633.2
2
(obs exp)
exp
2
184.7
Linkage equilibrium highly unlikely!
D P11P22 P12 P21 0.07
Genome Evolution © Amos Tanay, The Weizmann Institute
Sources of Linkage disequilibrium
LD in original population that was not stabilized due to low r
Genetic coadaptation: regions of the genome that are not subject to
recombination (for example, inverted chromosomal fragments)
Admixture of populations with different allele frequencies:
D0
D0
P11 0.0025
P11 0.9025
P12 0.0475
P12 0.0475
P21 0.0475
P21 0.0475
P22 0.9025
P22 0.0025
P11 0.4525
P12 0.0475
P21 0.0475
P22 0.4525
D 0.2025
Genome Evolution © Amos Tanay, The Weizmann Institute
The hapmap project
1 million SNPs (single nucleotide polymorphisms)
4 populations:
30 trios (parents/child) from Nigeria (Yoruba - YRI)
30 trios (parents/child) from Utah (CEU)
45 Han chinease (Beijing)
44 Japanease (Tokyo)
Haplotyping – each SNP/individual
No just determining heterozygosity/homozygosity – haplotyping completely resolve the
genotypes (phasing)
Because of linkage, the partial SNP
Map largely determine all other SNPs!!
The idea is that a group of “tag SNPs”
Can be used for representing all genetic
Variation in the human population.
This is extremely important in association
studies that look for the genetic cause of
disease.
Genome Evolution © Amos Tanay, The Weizmann Institute
Correlation on SNPs between populations
Genome Evolution © Amos Tanay, The Weizmann Institute
Recombination rates in the human population: LD blocks
Genome Evolution © Amos Tanay, The Weizmann Institute
Recombination rates in the human population
Recombination rates are highly non uniform – with major effects on genome structure!
Genome Evolution © Amos Tanay, The Weizmann Institute
Mutations
Simplest model: assume two alleles, and mutations probabilities:
Pr( A a )
Pr( a A)
If the process is running long enough, we will converge to a stationary distribution:
Pr( A)
A
a
As we saw earlier, since population is finite and undergo random genetic drift
any mutation will ultimately be lost or fixated.
Elimination have a significant chance of happening immediately::
1
2N
sampling
(1
1 2N
) 1/ e
2N
Genome Evolution © Amos Tanay, The Weizmann Institute
Infinite alleles model
Adding mutations with probability m, the coalescent process is extended by killing lineages
(time is speeded up by a 2N factor):
Coalescent:
k ( k 1) 1
2
2N
mutation:
k 2 N k
2
, ( 4 N )
Probability model (Hoppe’s Urn):
Selecting from an urn with one black ball of
mass and more balls with other colors
and mass 1. Each time the black ball is
selected, a new ball with a new color is
added to the urn. If another color is
selected, the selected ball and another ball
from the same color are returned to the
urn.
Theorem: Hoppe’s Urn and the Coalescent
with killing are equivalent
(The Chinese restaurant process)
Back in time
Genome Evolution © Amos Tanay, The Weizmann Institute
Testing the infinite alleles model
Theorem (Ewens sampling formula): Let ai be the number of alleles present
i times in a sample of size n. When the scaled mutation rate is =4N,
A simplified statistics is the number of distinct alleles. This should have the
expected value:
E (k ) 1
1 2
..
n 1
Proof: At each step of the Hoppe’s process,
we draw the black ball with probability:
i 1
Genome Evolution © Amos Tanay, The Weizmann Institute
Testing the infinite alleles model
Figure 7.16,7.17
Not quite neutral
VNTR locus in humans: observed
(open columns) and Ewens
predicted allele counts.
Highly non neutral
F computed from the number of Xdh alleles in 89 D.
pseudoobscura lines gene: 52 had a common
allele, 8 singletons.
Compared to a simulation assuming the infinite allele
model.