Genome evolution: a sequence

Download Report

Transcript Genome evolution: a sequence

Genome Evolution © Amos Tanay, The Weizmann Institute
Genome evolution
Lecture 3:
population genetics I: mutation and
recombination
Genome Evolution © Amos Tanay, The Weizmann Institute
Population genetics
Drift: The process by which allele frequencies are changing through
generations
Mutation: The process by which new alleles are being introduced
Recombination: the process by which multi-allelic genomes are mixed
Selection: the effect of fitness on the dynamics of allele drift
Epistasis: the effects of fitness dependencies among different alleles
“Organismal” effects: Ecology, Geography, Behavior
Genome Evolution © Amos Tanay, The Weizmann Institute
Wright-Fischer model for genetic drift
∞
gametes
N
individuals
N
individuals
∞
gametes
We follow the frequency of an allele in the population, until fixation (f=2N) or loss (f=0)
We can model the frequency as a Markov process on a variable X (the number of A alleles)
with transition probabilities:
 2 N  i  
i 

Tij  
 1 

j
2
N
2
N
 



j
2N  j
Sampling j alleles from a
population 2N population
with i alleles.
In larger population the frequency would change more slowly (the variance of the binomial
variable is pq/2N – so sampling wouldn’t change that much)
Loss
0
1
2N-1
2N
Fixation
Genome Evolution © Amos Tanay, The Weizmann Institute
Mutations vs Drift
Diversity (q)= chance of
having same genotype on
two random individuals
Mutations are generating
population diversity
Mutation is happening is some
biologically dependent rate m
(more on that later in the course)
Drift is eliminating
population’s diversity
through fixation
Fixation is happening in a rate of
~4N generation
How will the population look like given both forces?
Genome Evolution © Amos Tanay, The Weizmann Institute
Stationary distribution when drift is dominating
If mutations is slow compared to drift, we can model the population as a single random
variable. Then evolution is a Markov process on two or more states of that variables
Simplest model: assume two alleles, and mutations probabilities:
Pr( A  a )  m
Pr( a  A)  
If the process is running long enough, we will converge to a stationary distribution:
Pr( A) 

A
 m
m

a
Remember – under these assumption, we are likely to sample the entire population at
either A or a state.
Think what conditions on the mutation rate can justify this model?
Genome Evolution © Amos Tanay, The Weizmann Institute
What happen when mutations are rapid?
If mutations is rapid compared to drift, we lose all population structure
This is just a random mixing process
Evolution cannot work in this way – information must be propagated
In practice, population maintain a non-trivial balance between mutation and drift
But we do not know the mutation rate (or the effective population size)
Genome Evolution © Amos Tanay, The Weizmann Institute
A coalescent model approach: Infinite alleles model
When alleles where measure at the protein levels, it was reasonable to assume mutations
were generating new variants (isozymes) – never reversing or repeating a variants
Adding mutations with probability m, the coalescent process is extended by killing lineages
(time is speeded up by a 2N factor):
Coalescent:
k ( k  1) 1

2
2N
mutation:
km  2 N  k
q
2
, (q  4 Nm )
Back in time
“Coalescent with killing”
Genome Evolution © Amos Tanay, The Weizmann Institute
Hoppe’s Urn
Probability model (Hoppe’s Urn):
Selecting from an urn with one black ball of
mass q and more balls with other colors and
mass 1. Each time the black ball is selected, a
new ball with a new color is added to the urn.
If another color is selected, the selected ball
and another ball from the same color are
returned to the urn.
Theorem: Hoppe’s Urn and the Coalescent
with killing are equivalent
(The Chinese restaurant process)
Probability = 1/(n+q)
Probability = q/(n+q)
Genome Evolution © Amos Tanay, The Weizmann Institute
Testing the infinite alleles model
Theorem (Ewens sampling formula): Let ai be the number of alleles present i
times in a sample of size n. When the scaled mutation rate is q=4Nm,
A simplified statistics is the number of distinct alleles. This should have the
expected value:
E (k )  1 
q

q
q 1 q  2
 .. 
q
q  n 1
Proof: At each step of the Hoppe’s process,
we draw the black ball with probability:
q
q  i 1
Genome Evolution © Amos Tanay, The Weizmann Institute
Testing the infinite alleles model
Figure 7.16,7.17
Not quite neutral
VNTR locus in humans: observed
(open columns) and Ewens
predicted allele counts.
Highly non neutral
F computed from the number of Xdh alleles in 89 D.
pseudoobscura lines gene: 52 had a common
allele, 8 singletons.
Compared to a simulation assuming the infinite allele
model.
Genome Evolution © Amos Tanay, The Weizmann Institute
Infinite sites model
In the infinite sites model, mutations occur at distinct sites, exactly once.
This model is appropriate for long DNA sequences
Theorem: Let m be the mutation rate for a locus under consideration, and set q=4Nm. Under
the infinite sites model, the expected number of segregating sites is:
n 1
1
i 1 i
E (S )  q 
Proof: Let tj be the amount of time in the coalescent during which there are j
lineages. We showed earlier that tj has approximately an exponential distribution
with mean 2/(j(j-1)). The total amount of time in the tree for a sample size n is:
n
Ttot   jt j
j 2
n
E (Ttot )  
j 2
n
2
1
j
 2
j ( j  1)
j  2 ( j  1)
Mutations occur at rate 2Nm:
E (Sn )  2 NmE (Ttot )
Genome Evolution © Amos Tanay, The Weizmann Institute
Infinite sites model
Theorem: q=4Nm. Under the infinite sites model, the number of segregating sites Sn has
n 1
1 2 n 1 1
V ( S n )  q  q  2
i 1 i
i 1 i
Proof: Let sj be the number of segregating sites created when there were j
lineages. While there are j lineages, we may get mutations at rate 2Nmj, and
coalescence at rate j(j-1)/2. Mutations occur before coalescence with probability:
2 Nuj
4 Nu

2 Nuj  j ( j  1) / 2 4 Nu  j  1
k
k successes:
It’s a shifted geometric
distribution:
 q

j 1

Pr( s j  k )  
k  0,1,2,..
q

j

1
q

j

1


1 p
(q  j 1) 2
q
Var(s j )  2 

p
q  j 1 ( j 1) 2

q 2  ( j 1)q
( j 1) 2

q
q2

j 1 ( j 1) 2
Genome Evolution © Amos Tanay, The Weizmann Institute
Watterson’s estimator, using the infinite site model
n 1
We can estimate q=4Nm from an empirical Sn
Theorem: For the Watterson’s estimator q w 
E (q w )  q
Sn
hn
1
i 1 i
E (S )  q 
g
1
2
V (q w )  q  q
2
hn
hn
So we can build a model of the population from as little data as S
What will happen if we want to incorporate more complex models? (e.g., expansion,
migration?)
Genome Evolution © Amos Tanay, The Weizmann Institute
Finite alleles model
If we think of a single DNA base, we only have 4 possible alleles
Our model must the include recurrent mutations
A
G
T
C
Even if we assume neutrality, our mutations can be come dependent
-We may have different rates at different sites
-We may have coupling of one base and the bases nearby
We may need to consider insertions and deletions
Importantly, if all these are neutral, then the basic coalescent structure is not affected
The Poission process:
(t ) j t
Pr(m  j ) 
e
j!
Expected = t
Genome Evolution © Amos Tanay, The Weizmann Institute
Using simulations
The sampling procedure:
Generate a large number of populations (using the model we presented)
Compute the distribution of your statistics on this random case
Compare it to the value you observe in your population
if you find a significant bias, some modeling assumption must be wrong
In principle, we can sample generation after generation, for sufficient time (how much?)
Direct simulation using Wright-Fischer is painfully expensive (why?)
If you are only interested in the current population, most of your coin tossing will be useless
We can use the coalescent approach and just sample genealogies, going back in time
For example, using the coalescent with killing
Important: this is analogous to first sample a tree and then scatter the mutations there
We can also think of simulation evolution while ignoring the population, based on the
Markov process shown above (what are the limitations here?)
Genome Evolution © Amos Tanay, The Weizmann Institute
Recombination and linkage
Assume two loci have alleles A1,A2, B1,B2
Linkage equilibrium:
Only double Heterozygous can allow
recombination to change allele frequencies:
P( A1 B1 )  p1q1
A1 B1
P( A1 B2 )  p1q2
P( A2 B1 )  p2 q1
A1B1/ A2B2
P( A2 B2 )  p2 q2
A2 B2
A1 B2
A1B2/ A1B2
A2 B1
The recombination fraction r: proportion of recombinant gametes generated from double
heterozygote
For different chromosomes: r = 0.5
For the same chromosome, function of the distance and possibly other factors
Genome Evolution © Amos Tanay, The Weizmann Institute
A1 B2
Linkage disequilibrium (LD)
A2 B1
P11  P( A1B1 ), P12  P( A1B2 ), P21  P( A2 B1 ), P22  P( A2 B2 )
r
A1 B1
Recombination on any A1- / -B1
A2 B2
No recomb
Next generation:
P11'  (1  r ) P11  rq1 p1
A1 B1
P11'  q1 p1  (1  r )( P11  p1q1 )
Define the linkage disequilibrium parameter D as:
A2 B2
1-r
A1 B1
D
A2 B2
D  P11  p1q1
Dn  (1  r ) Dn 1  (1  r ) n D0
r=0.05
r=0.5
r=0.2
D  P11P22  P12 P21
Generation
Genome Evolution © Amos Tanay, The Weizmann Institute
Linkage disequilibrium (LD) - example
blood group genotypes M/N and S/s. Both alleles in Hardy-Weinberg
For M/N –
For S/s –
p1 = 0.5425
q1 = 0.3080
Observed
p2 = 0.4575
q2 = 0.6920
unlinked
MS
484
334.2
Ms
611
750.8
NS
142
281.8
Ns
773
633.2

2
 (obs  exp)

 exp
2
 184.7
Linkage equilibrium highly unlikely!
D  P11P22  P12 P21  0.07
Genome Evolution © Amos Tanay, The Weizmann Institute
Sources of Linkage disequilibrium
LD in original population that was not stabilized due to low r
Genetic coadaptation: regions of the genome that are not subject to
recombination (for example, inverted chromosomal fragments)
Admixture of populations with different allele frequencies:
D0
D0
P11  0.0025
P11  0.9025
P12  0.0475
P12  0.0475
P21  0.0475
P21  0.0475
P22  0.9025
P22  0.0025
P11  0.4525
P12  0.0475
P21  0.0475
P22  0.4525
D  0.2025
Genome Evolution © Amos Tanay, The Weizmann Institute
Recombination rates in the human population: LD blocks
Genome Evolution © Amos Tanay, The Weizmann Institute
Recombination rates in the human population
Recombination rates are highly non uniform – with major effects on genome structure!
Genome Evolution © Amos Tanay, The Weizmann Institute
Selection
Fitness: the relative reproductive success of an individual (or genome)
Fitness is only defined with respect to the current population.
Fitness is unlikely to remain constant in all conditions and environments
Sampling
probability is
multiplied by a
selection
factor 1+s
Mutations can change fitness
A deleterious mutation decrease fitness. It would therefore be selected
against. This process is called negative or purifying selection.
A advantageous or beneficial mutation increase fitness. It would therefore
be subject to positive selection.
A neutral mutation is one that do not change the fitness.
Genome Evolution © Amos Tanay, The Weizmann Institute
The Moran model
Instead of working with discrete generation, we replace at most one individual at each time
step
A
t
A
t
A
a
a
X
A
A
A
a
a
a
A
A
A
A
A
A
Replace by
sampling from
the current
population
t  0
We assume time steps are small, what kind of mathematical models is describing the
process?
Genome Evolution © Amos Tanay, The Weizmann Institute
Continuous time Markov processes
P( x, s; t , A)  Pr( X t A | X s  x) t [0, )
Markov
Conditions on transitions:
Pij (t )  0
 P (t )  1
ij
j
 Pik (t ) Pkj (h)  Pij (t  h) t , h  0
k
1 i  j
lim Pij (t )  
t 0
0 i  j
Theorem:
1  Pii (t )
 qii
t 0
t
Pij (t )
Pij ' (0)  lim
 qij
t 0
t
 Pii ' (0)  lim
exists (may be infinite)
exists and finite
Kolmogorov
Genome Evolution © Amos Tanay, The Weizmann Institute
Rates and transition probabilities
The process’s rate matrix:

  q0 i

 i 0
 q10

 ..

 ..
 q
n0


Q

q01
q02 ..
q0 n
 q1i
q12
..
q1 n
..
..
qn1
..
.. ..
..
.. ..
qn 2 ..  qni
i 1

in
Transitions differential equations (backward form):
Pij ( s  t )  Pij (t )   Pik ( s) Pkj (t )  Pij (t )
k
  Pik ( s) Pkj (t )  [ Pii ( s)  1]Pij (t )
k i
s  0  P'ij (t )   qik Pkj (t ) q ii Pij (t )
k i
P' (t )  QP (t )  P(t )  exp( Qt )











Genome Evolution © Amos Tanay, The Weizmann Institute
The Moran model
A
t
A
t
A
a
a
X
A
A
A
a
a
a
A
A
A
A
A
A
Replace by
sampling from
the current
population
t  0
Assume the rate of replacement for each individual is 1,
We derive a model similar to Wright-Fischer, but in continuous time. A process on a random
variable counting the number of allele A:
Loss
0
i-1
1
i
i+1
2N-1
i  i 1
bi  (2 N  i ) 
i  i 1
di  i 
Rates:
2N  i
2N
i
2N
“Birth”
“Death”
2N
Fixation
Genome Evolution © Amos Tanay, The Weizmann Institute
Fixation probability
Loss
0
i-1
1
i
i+1
2N-1
i  i 1
bi  (2 N  i ) 
i  i 1
di  i 
Rates:
2N  i
2N
i
2N
2N
Fixation
“Birth”
“Death”
In fact, in the limit, the Moran model converge to the Wright-Fischer model, for example:
Theorem: When going backward in time, the Moran model generate the same distribution
of genealogy as Wright-Fischer, only that the time is twice as fast
Theorem: In the Moran model, the probability that A becomes fixed when there are initially I
copies is i/2N
Proof: like the proof for the Wright-Fischer model. The expected X value is unchanged
since the probability of births and deaths is the same
Genome Evolution © Amos Tanay, The Weizmann Institute
Fixation time
Ei  Ei ( | T2 N  To )
Expected fixation time assuming fixation
Theorem: In the Moran model, let p = i / 2N, then:
Proof: not here..
Ei  
2 N (1  p)
log( 1  p)
p