L14 - Computer Science and Engineering

Download Report

Transcript L14 - Computer Science and Engineering

CSE280b: Population Genetics
Vineet Bafna/Pavel Pevzner
March 2006
www.cse.ucsd.edu/classes/sp05/cse291
Vineet Bafna
Review
•
•
Hardy Weinberg Equilibrium
Linkage Equlibrium
March 2006
Vineet Bafna
Recombination and Linkage Equilibrium
0
0
0
0
1
1
0
•
•
•
1
1
1
1
0
0
0
1
1
In a freely mixing population, an individual chromosome
randomly chooses its parent from the available pool
With unimpeded recombination (Linkage Equilibrium), the
individual freely chooses its two parent chromsomes, and then
freely chooses alleles from the parents
What is the probability of seeing the allele <1 1>?
March 2006
Vineet Bafna
Measures of LD
•
•
Consider two bi-allelic sites with alleles
marked with 0 and 1
Define
–
–
•
•
P00 = Pr[Allele 0 in locus 1, and 0 in locus 2]
P0* = Pr[Allele 0 in locus 1]
Linkage equilibrium if P00 = P0* P*0
D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = …
March 2006
Vineet Bafna
LD over time
•
With random mating, and fixed recombination rate r
between the sites, Linkage Disequilibrium will
disappear over time
–Let
D(t) = LD at time t
–P(t)00 =
(1-r) P(t-1)00 + r P(t-1)0* P(t-1)*0
–D(t) = P(t)00 -
–D(t) =(1-r)
March 2006
P(t)0* P(t)*0 = P(t)00 - P(t-1)0* P(t-1)*0
D(t-1) =(1-r)t D(0)
Vineet Bafna
(HW)
LD over distance
•
Assumption
–
•
Recombination rate increases linearly with distance
Let r be the (constant) recombination rate per bp per
generation
D(t) = P(t)00- P0* P*0
= (1-r)d P(t-1)00+ [1- (1-r)d ] P0* P*0 - P0* P*0
= (1-r)d D(t-1)
•The
assumption is reasonable, but recombination rates
vary from region to region, adding to complexity
March 2006
Vineet Bafna
LD and disease mapping
•
•
•
Consider a mutation that is causal for a disease.
The goal of disease gene mapping is to discover
which gene (locus) carries the mutation.
Consider every polymorphism, and check:
–
–
•
There might be too many polymorphisms
Multiple mutations (even at a single locus) that lead to the
same disease
Instead, consider a dense sample of polymorphisms
that span the genome
March 2006
Vineet Bafna
LD can be used to map disease genes
LD
0
1
1
0
0
1
D
N
N
D
D
N
•
•
LD decays with distance from the disease allele.
By plotting LD, one can short list the region
containing the disease gene.
March 2006
Vineet Bafna
LD and disease gene mapping problems
•
•
•
Marker density?
Complex diseases
Population sub-structure
March 2006
Vineet Bafna
Population Genetics
•
•
•
•
Often we look at these equilibria
(Linkage/HW) and their deviations in specific
populations
These deviations offer insight into evolution.
However, what is Normal?
A combination of empirical (simulation) and
theoretical insight helps distinguish between
expected and unexpected.
March 2006
Vineet Bafna
Topic 2: Simulating population data
•
•
We described various population genetic concepts
(HW, LD), and their applicability
The values of these parameters depend critically upon
the population assumptions.
–
–
–
–
–
•
What if we do not have infinite populations
No random mating (Ex: geographic isolation)
Sudden growth
Bottlenecks
Ad-mixture
It would be nice to have a simulation of such a
population to test various ideas. How would you do
this simulation?
March 2006
Vineet Bafna
Wright Fisher Model of Evolution
•
•
Fixed population size from generation to generation
Random mating
March 2006
Vineet Bafna
Coalescent model
•
Insight 1:
–
–
–
Separate the genealogy from allelic states (mutations)
First generate the genealogy (who begat whom)
Assign an allelic state (0) to the ancestor. Drop mutations on the branches.
March 2006
Vineet Bafna
Coalescent model
•
–
–
–
–
•
0
0
0
0
0
Insight 1:
Assign an allelic state (0) to
the ancestor. Drop mutations
on the branches.
The mutations are proportional
to the branch length.
Each site (locus) mutates at
most once.
At the end, drop any fixed sites
Loci
How efficient is this? How
many generations do we need
to simulate today’s population?
0
1
1
0
1
0
0
1
1
1
0
1
1
0
1
0
0
0
0
1
Individuals
March 2006
Vineet Bafna
Loci
Coalescent theory
•
Insight 2:
–
–
Much of the genealogy is irrelevant, because it
disappears.
Better to go backwards
March 2006
Vineet Bafna
Coalescent
•
•
•
•
•
Note that in the Wright Fischer
model, the population is freely
mixing, and constant size N
One way to think about it is that
each individual in the current
generation selects a parent
uniformly at random from the N
individuals in the previous
generation.
When two individuals choose the
same parent, they coalesce.
Once they coalesce, they stay
together. We continue until only one
individual is left.
Note that this only gives a random
topology with labeled leaves. The
only thing of interest is branch length
(number of generations to MRCA)
March 2006
Vineet Bafna
1
2
3
4
Coalescent theory (Kingman)
•
Input
–
•
(Fixed population (N individuals), random mating)
Consider 2 individuals.
–
Probability that they coalesce in the previous
generation (have the same parent)=
1
N
•

Probability that they do not coalesce
after t
generations= 1 1 t  e t N

March 2006
N

Vineet Bafna
Coalescent theory
•
Consider k individuals.
–
Probability that no pair coalesces after 1
generation
–
Probability that no pair coalesces after t
generations
 k t

k 
t


2 
 

 2
k2 
N
 
1

e

e


N




March 2006
Vineet Bafna
 is time in units
of N generations
Coalescent approximation
•
Insight 3:
–
–
Topology is independent of coalescent times
If you have n individuals, generate a random binary topology
•
Iterate (until one individual)
–
•
Pick a pair at random, and coalesce
Insight 4:
–
To generate coalescent times, there is no need to go back
generation by generation. Generate n random variables to
get the n coalescence times.
March 2006
Vineet Bafna
Coalescent approximation
•
•
At any step, there are 1 <= k <= n individuals
To generate time to coalesce (k to k-1 individuals)
–
–
Pick a number from exponential distribution with rate k(k-1)/2
Mean time to coalescence (in units of N generations)
= 2/(k(k-1))
March 2006
Vineet Bafna
Mean time to coalesce
•
If there are k individuals, the Probability for a
coalescence in one generation is
–
•
k(k-1)/2N
Expected time to coalesce = 2N/k(k-1)
k 
 
2
p   
N
E(k)  p  2qp  3q 2 p
Note that 1 + q  q 2  q 3 
Differentiating,
March 2006
E(k)  1
p
 11 q
1 + 2q  3q 2  4q 3 
Vineet Bafna
 1
p2
Typical coalescents
•
•
4 random examples with n=6 (Note that we
do not need to specify N. Why?)
Expected time to coalesce to 1 node?
March 2006
Vineet Bafna
Coalescent properties
•
Expected time for the last step
=1
•
•
•
The last step is half of the total time to coalesce
Studying larger number of individuals does not change numbers
tremendously
EX: Number of mutations in a population is proportional to the
total branch length of the tree
–
E(Ttot)
March 2006
Vineet Bafna
Variants (exponentially growing populations)
•
•
If the population is growing
exponentially, the branch
lengths become similar, or
even star-like. Why?
With appropriate scaling of
time, the same process
can be extended to
various scenarios: malefemale, hermaphrodite,
segregation, migration,
etc.
March 2006
Vineet Bafna
Simulating population data
•
•
•
Generate a coalescent (Topology + Branch lengths)
For each branch length, drop mutations with rate 
Generate sequence data
•
•
•
Note that the resulting sequence is a perfect phylogeny.
Given such sequence data, can you reconstruct the coalescent
tree? (Only the topology, not the branch lengths)
Also, note that all pairs of positions are correlated (should have
high LD).
March 2006
Vineet Bafna
Coalescent with Recombination
•
An individual
may have one
parent, or 2
parents
March 2006
Vineet Bafna
ARG: Coalescent with recombination
•
•
•
•
Given: mutation rate , recombination rate
, population size 2N (diploid), sample size
n.
How can you generate the ARG
(topology+branch lengths) efficiently?
How will you generate sequences for n
individuals?
Given sequence data, can you reconstruct
the ARG (topology)
March 2006
Vineet Bafna
Recombination
•
Define r as the probability of recombining
per generation.
–
•
Note that the parameter is a value
which will be defined later
Assume k individuals in a generation.
The following might happen:
1.
2.
3.
4.
An individual arises because of a
recombination event between two
individuals (It will have 2 parents).
Two individuals coalesce.
Neither (Each individual has a distinct
parent).
Multiple events (low probability).
March 2006
Vineet Bafna
Recombination
•
•
•
•
•
We ignore the case of multiple (> 1) events in one
generation
Pr (No recombination) = 1-kr
 k 
Pr (No coalescence)
 
 2
1

 2N 


Consider scaled time in units of 2N generations. Thus
the number of individuals increase with rate kr2N, and
decrease with rate  k
2
The value 2rN is usually small, and therefore, the
process will ultimately coalesce to a single individual

(MRCA)
March 2006
Vineet Bafna
ARG
•
•
•
Let k = n,
Define
  4rN
Iterate until k= 1
–
 –
What is the flaw in
this procedure?
Choose time from an exponential distribution with rate
k k 
 2
2  
Pick event as recombination with probability

  (k 1)
–
–
If event
 is recombination, choose an individual to recombine, and a
position, else choose a pair to coalesce.
Update k, and continue

March 2006
Vineet Bafna
Ancestral Recombination Graph
March 2006
Vineet Bafna
Simulating sequences on the ARG
•
•
•
Generate topology and branch lengths as before
For each recombination, generate a position.
Next generate mutations at random on branch
lengths
–
For a mutation, select a position as well.
March 2006
Vineet Bafna
Review: Coalescent theory applications
•
Coalescent simulations allow us to test various hypothesis. The
March 2006
Vineet Bafna
coalescent/ARG is usually not
inferred, unlike in phylogenies.
Coalescent theory: example
•
Ex: ~1400bp at Sod locus in Dros.
–
–
–
10 taxa
5 were identical. The other 5 had 55 mutations.
Q: Is this a chance event, or is there selection for
this haplotype.
March 2006
Vineet Bafna
Coalescent application
–
–
–
–
–
10000 coalescent
simulations were performed
on 10 taxa.
55 mutations on the
coalescent branches
Count the number of times 5
lineages are identical
The event happened in 1.1%
of the cases.
Conclusion: selection, or
some other mechanism
explains this data.
March 2006
Vineet Bafna
Coalescent example: Out of Africa
hypothesis
•
•
Looking at lineage specific mutations might help discard the
candelabra model. How?
How do we decide between the multi-regional and Out-of-Africa
model? How do we decide if the ancestor was African?
March 2006
Vineet Bafna
Human Samples
•
•
We look at data from human samples
Gabriel et al. Science 2002.
–
3 populations were sampled at multiple regions spanning the
genome
•
•
•
•
•
•
March 2006
54 regions (Average size 250Kb)
SNP density 1 over 2Kb
90 Individuals from Nigeria (Yoruban)
93 Europeans
42 Asian
50 African American
Vineet Bafna
Population specific recombination
•
•
D’ was used as the
measure between SNP
pairs.
SNP pairs were classified
in one of the following
–
–
–
•
Strong LD
Strong evidence for
recombination
Others (13% of cases)
This roughly favors out-ofafrica. A Coalescent
simulation can help give
confidence values on this.
March 2006
Gabriel et al., Science 2002
Vineet Bafna
Recombination events and 
•
•
•
Given , n, can you compute the expected
number of recombination events?
It can be shown that E(n, ) =  log (n)
Questions that people are interested in
•
•
Given a set of sequences from a population,
compute the recombination rate 
Given a population reconstruct the most likely
history (as an ancestral recombination graph)
March 2006
Vineet Bafna
Re-constructing history without the
coalescent
March 2006
Vineet Bafna
An algorithm for constructing a perfect phylogeny
•
•
We will consider the case where 0 is the
ancestral state, and 1 is the mutated state. This
will be fixed later.
In any tree, each node (except the root) has a
single parent.
–
•
•
It is sufficient to construct a parent for every node.
In each step, we add a column and refine some
of the nodes containing multiple children.
Stop if all columns have been considered.
March 2006
Vineet Bafna
Inclusion Property
•
•
For any pair of columns i,j
– i < j if and only if i1  j1
Note that if i<j then the edge
containing i is an ancestor of
the edge containing i
i
j
March 2006
Vineet Bafna
Example
r
1 2 3 4 5
A 1 1 0 0 0
B 0 0 1 0 0
C 1 1 0 1 0
D 0 0 1 0 1
E 1 0 0 0 0
A
B
Initially, there is a single clade r, and
each node has r as its parent
March 2006
Vineet Bafna
C D
E
Sort columns
•
•
Sort columns according to
the inclusion property (note
that the columns are
already sorted here).
This can be achieved by
considering the columns as
binary representations of
numbers (most significant
bit in row 1) and sorting in
decreasing order
March 2006
Vineet Bafna
A
B
C
D
E
1
1
0
1
0
1
2
1
0
1
0
0
3
0
1
0
1
0
4
0
0
1
0
0
5
0
0
0
1
0
Add first column
•
In adding column i
–
–
Check each edge and
decide which side you
belong.
Finally add a node if
you can resolve a clade
A
B
C
D
E
1 2 3 4 5
1 1 0 0 0
0 0 1 0 0
1 1 0 1 0
0 0 1 0 1
1 0 0 0 0
r
u
A
March 2006
Vineet Bafna
C
E
B
D
Adding other columns
•
Add other
columns on
edges using the
ordering
property
A
B
C
D
E
1
1
0
1
0
1
2
1
0
1
0
0
3
0
1
0
1
0
4
0
0
1
0
0
r
1
E
3
2
B
5
4
D
C
March 2006
Vineet Bafna
A
5
0
0
0
1
0
Unrooted case
•
•
Switch the values in each column, so that 0 is
the majority element.
Apply the algorithm for the rooted case
March 2006
Vineet Bafna
March 2006
Vineet Bafna
March 2006
Vineet Bafna