L13 - Computer Science and Engineering

Download Report

Transcript L13 - Computer Science and Engineering

CSE280b: Population Genetics
Vineet Bafna/Pavel Pevzner
March 2006
www.cse.ucsd.edu/classes/sp05/cse291
Vineet Bafna
Population Genetics
•
•
•
•
Individuals in a species
(population) are
phenotypically different.
Often these differences
are inherited (genetic).
Studying these
differences is important!
Q:How predictive are
these differences?
March 2006
Vineet Bafna
EX:Population Structure
•
•
377 locations (loci) were sampled in 1000 people from 52
populations.
6 genetic clusters were obtained, which corresponded to 5
geographic regions (Rosenberg et al. Science 2003)
Genetic differences can predict ethnicity.
Africa
March 2006
Eurasia
East Asia
Vineet Bafna
Oceania
•
America
Scope of these lectures
•
•
Basic terminology
Key principles
–
–
–
–
–
–
–
–
–
Sources of variation
HW equilibrium
Linkage
Coalescent theory
Recombination/Ancestral Recombination Graph
Haplotypes/Haplotype phasing
Population sub-structure
Structural polymorphisms
Medical genetics basis: Association mapping/pedigree
analysis
March 2006
Vineet Bafna
Alleles
•
•
Genotype: genetic makeup of an individual
Allele: A specific variant at a location
–
–
–
•
The notion of alleles predates the concept of gene, and
DNA.
Initially, alleles referred to variants that described a
measurable phenotype (round/wrinkled seed)
Now, an allele might be a nucleotide on a chromosome, with
no measurable phenotype.
Humans are diploid, they have 2 copies of each
chromosome.
–
–
–
They may have heterozygosity/homozygosity at a location
Other organisms (plants) have higher forms of ploidy.
Additionally, some sites might have 2 allelic forms, or even
many allelic forms.
March 2006
Vineet Bafna
What causes variation in a population?
•
•
•
•
Mutations (may lead to SNPs)
Recombinations
Other genetic events (gene conversion)
Structural Polymorphisms
March 2006
Vineet Bafna
Single Nucleotide Polymorphisms
Infinite Sites Assumption:
Each site mutates at most
once
March 2006
Vineet Bafna
00000101011
10001101001
01000101010
01000000011
00011110000
00101100110
Short Tandem Repeats
GCTAGATCATCATCATCATTGCTAG
GCTAGATCATCATCATTGCTAGTTA
GCTAGATCATCATCATCATCATTGC
GCTAGATCATCATCATTGCTAGTTA
GCTAGATCATCATCATTGCTAGTTA
GCTAGATCATCATCATCATCATTGC
March 2006
Vineet Bafna
4
3
5
3
3
5
STR can be used as a DNA fingerprint
•
•
•
Consider a collection of
regions with variable length
repeats.
Variable length repeats will
lead to variable length DNA
Vector of lengths is a fingerprint
4
3
5
3
3
5
2
3
1
2
1
3
loci
March 2006
Vineet Bafna
Recombination
00000000
11111111
00011111
March 2006
Vineet Bafna
Gene Conversion
•
Gene Conversion
versus crossover
–
Hard to distinguish in
a population
March 2006
Vineet Bafna
Structural polymorphisms
•
Large scale structural changes
(deletions/insertions/inversions) may occur
in a population.
March 2006
Vineet Bafna
Topic 1: Basic Principles
•
In a ‘stable’ population, the distribution of
alleles obeys certain laws
–
•
HW Equilibrium
–
•
Not really, and the deviations are interesting
(due to mixing in a population)
Linkage (dis)-equilibrium
–
Due to recombination
March 2006
Vineet Bafna
Hardy Weinberg equilibrium
•
•
•
•
Consider a locus with 2 alleles, A, a
p (respectively, q) is the frequency of A (resp.
a) in the population
3 Genotypes: AA, Aa, aa
Q: What is the frequency of each genotype
If various assumptions are satisfied, (such as
random mating, no natural selection), Then
• PAA=p2
• PAa=2pq
• Paa=q2
March 2006
Vineet Bafna
Hardy Weinberg: why?
•
Assumptions:
–
–
–
–
–
•
Diploid
Sexual reproduction
Random mating
Bi-allelic sites
Large population size, …
Why? Each individual randomly picks his two
chromosomes. Therefore, Prob. (Aa) = pq+qp
= 2pq, and so on.
March 2006
Vineet Bafna
Hardy Weinberg: Generalizations
•
Multiple alleles with frequencies
–
By HW,
1,2, , H
Pr[homozygous genotype i] =  i2
 Pr[heterozygous genotype i, j] = 2 
i j
•
Multiple loci?

March 2006
Vineet Bafna
Hardy Weinberg: Implications
•
•
•
•
The allele frequency does not change from
generation to generation. Why?
It is observed that 1 in 10,000 caucasians have the
disease phenylketonuria. The disease mutation(s)
are all recessive. What fraction of the population
carries the disease?
Males are 100 times more likely to have the “red’ type
of color blindness than females. Why?
Conclusion: While the HW assumptions are rarely
satisfied, the principle is still important as a baseline
assumption, and significant deviations are interesting.
March 2006
Vineet Bafna
Recombination
00000000
11111111
00011111
March 2006
Vineet Bafna
What if there were no recombinations?
•
•
•
Life would be simpler
Each individual sequence would have a
single parent (even for higher ploidy)
The relationship is expressed as a tree.
March 2006
Vineet Bafna
The Infinite Sites Assumption
00000000
3
00100000
8
00100001
•
5
00101000
The different sites are linked. A 1 in position 8 implies 0 in
position 5, and vice versa.
• Some phenotypes could be linked to the polymorphisms
• Some of the linkage is Vineet
“destroyed”
by recombination
March 2006
Bafna
Infinite sites assumption and Perfect Phylogeny
•
•
Each site is mutated at
most once in the history.
All descendants must carry
the mutated value, and all
others must carry the
ancestral value
i
1 in position i
0 in position i
March 2006
Vineet Bafna
Perfect Phylogeny
•
•
Assume an evolutionary model in which no
recombination takes place, only mutation.
The evolutionary history is explained by a
tree in which every mutation is on an edge of
the tree. All the species in one sub-tree
contain a 0, and all species in the other
contain a 1. Such a tree is called a perfect
phylogeny.
March 2006
Vineet Bafna
The 4-gamete condition
•
•
•
A column i partitions the set
of species into two sets i0,
and i1
A column is homogeneous
w.r.t a set of species, if it has
the same value for all
species. Otherwise, it is
heterogenous.
EX: i is heterogenous w.r.t
{A,D,E}
March 2006
Vineet Bafna
A
i0 B
C
D
i1 E
F
i
0
0
0
1
1
1
4 Gamete Condition
•
4 Gamete Condition
–
–
–
There exists a perfect phylogeny if and only if for
all pair of columns (i,j), j is not heterogenous w.r.t
i0, or i1.
Equivalent to
There exists a perfect phylogeny if and only if for
all pairs of columns (i,j), the following 4 rows do
not exist
(0,0), (0,1), (1,0), (1,1)
March 2006
Vineet Bafna
4-gamete condition: proof (only if)
•
•
•
Depending on which
edge the mutation j
occurs, either i0, or i1
should be homogenous.
(only if) Every perfect
phylogeny satisfies the 4gamete condition
(if) If the 4-gamete
condition is satisfied,
does a prefect phylogeny
exist?
i
j
i0
March 2006
Vineet Bafna
i1
Handling recombination
•
•
A tree is not sufficient as a sequence may
have 2 parents
Recombination leads to loss of correlation
between columns
March 2006
Vineet Bafna
Linkage (Dis)-equilibrium (LD)
•
•
•
Consider sites A &B
Case 1: No recombination
Each new individual
chromosome chooses a
parent from the existing
‘haplotype’
March 2006
Vineet Bafna
A
0
0
0
0
1
1
1
1
B
1
1
0
0
0
0
0
0
1
0
Linkage (Dis)-equilibrium (LD)
•
•
•
Consider sites A &B
Case 2: diploidy and
recombination
Each new individual
chooses a parent from the
existing alleles
March 2006
Vineet Bafna
A
0
0
0
0
1
1
1
1
B
1
1
0
0
0
0
0
0
1
1
Linkage (Dis)-equilibrium (LD)
•
Consider sites A &B
•
Case 1: No recombination
Each new individual chooses a parent
from the existing ‘haplotype’
– Pr[A,B=0,1] = 0.25
• Linkage disequilibrium
Case 2: Extensive recombination
Each new individual simply chooses
and allele from either site
– Pr[A,B=(0,1)=0.125
• Linkage equilibrium
•
•
•
March 2006
Vineet Bafna
A
0
0
0
0
1
1
1
1
B
1
1
0
0
0
0
0
0
LD
•
In the absence of recombination,
–
–
•
Correlation between columns
The joint probability Pr[A=a,B=b] is different from
P(a)P(b)
With extensive recombination
–
Pr(a,b)=P(a)P(b)
March 2006
Vineet Bafna
Measures of LD
•
•
Consider two bi-allelic sites with alleles
marked with 0 and 1
Define
–
–
•
•
P00 = Pr[Allele 0 in locus 1, and 0 in locus 2]
P0* = Pr[Allele 0 in locus 1]
Linkage equilibrium if P00 = P0* P*0
D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = …
March 2006
Vineet Bafna
LD over time
•
With random mating, and fixed recombination rate r
between the sites, Linkage Disequilibrium will
disappear
–
–
–
–
Let D(t) = LD at time t
P(t)00 = (1-r) P(t-1)00 + r P(t-1)0* P(t-1)*0
D(t) = P(t)00 - P(t)0* P(t)*0 = P(t)00 - P(t-1)0* P(t-1)*0 (HW)
D(t) =(1-r) D(t-1) =(1-r)t D(0)
March 2006
Vineet Bafna
LD over distance
•
Assumption
–
–
•
•
Recombination rate increases linearly with
distance
LD decays exponentially with distance.
The assumption is reasonable, but
recombination rates vary from region to
region, adding to complexity
This simple fact is the basis of disease
association mapping.
March 2006
Vineet Bafna
LD and disease mapping
•
•
•
Consider a mutation that is causal for a disease.
The goal of disease gene mapping is to discover
which gene (locus) carries the mutation.
Consider every polymorphism, and check:
–
–
•
There might be too many polymorphisms
Multiple mutations (even at a single locus) that lead to the
same disease
Instead, consider a dense sample of polymorphisms
that span the genome
March 2006
Vineet Bafna
LD can be used to map disease genes
LD
0
1
1
0
0
1
D
N
N
D
D
N
•
•
LD decays with distance from the disease allele.
By plotting LD, one can short list the region
containing the disease gene.
March 2006
Vineet Bafna
LD and disease gene mapping problems
•
•
•
Marker density?
Complex diseases
Population sub-structure
March 2006
Vineet Bafna
Population Genetics
•
•
•
•
Often we look at these equilibria
(Linkage/HW) and their deviations in specific
populations
These deviations offer insight into evolution.
However, what is Normal?
A combination of empirical (simulation) and
theoretical insight helps distinguish between
expected and unexpected.
March 2006
Vineet Bafna
Topic 2: Simulating population data
•
•
We described various population genetic concepts
(HW, LD), and their applicability
The values of these parameters depend critically upon
the population assumptions.
–
–
–
–
–
•
What if we do not have infinite populations
No random mating (Ex: geographic isolation)
Sudden growth
Bottlenecks
Ad-mixture
It would be nice to have a simulation of such a
population to test various ideas. How would you do
this simulation?
March 2006
Vineet Bafna
Wright Fisher Model of Evolution
•
•
Fixed population size from generation to generation
Random mating
March 2006
Vineet Bafna
Coalescent model
•
Insight 1:
–
–
–
Separate the genealogy from allelic states (mutations)
First generate the genealogy (who begat whom)
Assign an allelic state (0) to the ancestor. Drop mutations on the branches.
March 2006
Vineet Bafna
Coalescent theory
•
Insight 2:
–
–
Much of the genealogy is irrelevant, because it
disappears.
Better to go backwards
March 2006
Vineet Bafna
Coalescent theory (Kingman)
•
Input
–
•
(Fixed population (N individuals), random mating)
Consider 2 individuals.
–
Probability that they coalesce in the previous
generation (have the same parent)=
1
N
•

Probability that they do not coalesce
after t
generations= 1 1 t  e t N

March 2006
N

Vineet Bafna
Coalescent theory
•
Consider k individuals.
–
Probability that no pair coalesces after 1
generation
–
Probability that no pair coalesces after t
generations
 k t

k 
t


2 
 

 2
k2 
N
 
1

e

e


N




March 2006
Vineet Bafna
 is time in units
of N generations
Coalescent approximation
•
Insight 3:
–
–
Topology is independent of coalescent times
If you have n individuals, generate a random
binary topology
•
Iterate (until one individual)
–
•
Pick a pair at random, and coalesce
Insight 4:
–
To generate coalescent times, there is no need to
go back generation by generation
March 2006
Vineet Bafna
Coalescent approximation
•
•
At any step, there are 1 <= k <= n individuals
To generate time to coalesce (k to k-1 individuals)
–
–
Pick a number from exponential distribution with rate k(k-1)/2
Mean time to coalescence
= 2/(k(k-1))
March 2006
Vineet Bafna
Typical coalescents
•
•
4 random examples with n=6 (Note that we
do not need to specify N. Why?)
Expected time to coalesce?
March 2006
Vineet Bafna
Coalescent properties
•
Expected time for the last step
=1
•
•
•
The last step is half of the total time to coalesce
Studying larger number of individuals does not change numbers
tremendously
EX: Number of mutations in a population is proportional to the
total branch length of the tree
–
E(Ttot)
March 2006
Vineet Bafna
Variants (exponentially growing populations)
•
•
If the population is growing
exponentially, the branch
lengths become similar, or
even star-like. Why?
With appropriate scaling of
time, the same process
can be extended to
various scenarios: malefemale, hermaphrodite,
segregation, migration,
etc.
March 2006
Vineet Bafna
Simulating population data
•
•
•
Generate a coalescent (Topology + Branch lengths)
For each branch length, drop mutations with rate 
Generate sequence data
•
•
•
Note that the resulting sequence is a perfect phylogeny.
Given such sequence data, can you reconstruct the coalescent
tree? (Only the topology, not the branch lengths)
Also, note that all pairs of positions are correlated (should have
high LD).
March 2006
Vineet Bafna
Coalescent with Recombination
•
An individual
may have one
parent, or 2
parents
March 2006
Vineet Bafna
ARG: Coalescent with recombination
•
•
•
•
Given: mutation rate , recombination rate
, population size 2N (diploid), sample size
n.
How can you generate the ARG
(topology+branch lengths) efficiently?
How will you generate sequences for n
individuals?
Given sequence data, can you reconstruct
the ARG (topology)
March 2006
Vineet Bafna
Recombination
•
Define r as the probability of
recombining.
–
•
Note that the parameter is a caled
value which will be defined later
Assume k individuals in a generation.
The following might happen:
1.
2.
3.
4.
An individual arises because of a
recombination event between two
individuals (It will have 2 parents).
Two individuals coalesce
Neither (Each individual has a distinct
parent)
Multiple events (low probability)
March 2006
Vineet Bafna
Recombination
•
•
•
•
•
We ignore the case of multiple (> 1) events in one
generation

Pr (No recombination) =k1-kr


 2 
Pr (No coalescence) 1  


2N 

Consider scaled
time in units of 2N generations. Thus

the number
of individuals increase with rate kr2N, and

k 

2
decrease with
rate
The value 2rN is usually small, and therefore, the
process
will ultimately coalesce to a single individual

(MRCA)
March 2006
Vineet Bafna
ARG
•
•
•
Let k = n,
Define
  4rN
Iterate until k= 1
–
–
What is the flaw in
this procedure?
Choose time from an exponential distribution with rate
Pick eventk
as recombination
with probability
k 
 2
2  
If event is recombination, choose an individual to recombine, and a
position, else choosea pair to coalesce.
– Update k, and continue
  (k 1)
–

March 2006
Vineet Bafna
Simulating sequences on the ARG
•
•
•
Generate topology and branch lengths as before
For each recombination, generate a position.
Next generate mutations at random on branch
lengths
–
For a mutation, select a position as well.
March 2006
Vineet Bafna
Recombination events and 
•
•
•
Given , n, can you compute the expected number of
recombination events?
It can be shown that E(n, ) =  log (n)
The question that people are really interested in
•
•
•
Given a set of sequences from a population, compute the
recombination rate 
Given a population reconstruct the most likely history (as an
ancestral recombination graph)
We will address this question in subsequent lectures
March 2006
Vineet Bafna
An algorithm for constructing a perfect phylogeny
•
•
We will consider the case where 0 is the
ancestral state, and 1 is the mutated state. This
will be fixed later.
In any tree, each node (except the root) has a
single parent.
–
•
•
It is sufficient to construct a parent for every node.
In each step, we add a column and refine some
of the nodes containing multiple children.
Stop if all columns have been considered.
March 2006
Vineet Bafna
Inclusion Property
•
•
For any pair of columns i,j
– i < j if and only if i1  j1
Note that if i<j then the edge
containing i is an ancestor of
the edge containing i
i
j
March 2006
Vineet Bafna
Example
r
1 2 3 4 5
A 1 1 0 0 0
B 0 0 1 0 0
C 1 1 0 1 0
D 0 0 1 0 1
E 1 0 0 0 0
A
B
Initially, there is a single clade r, and
each node has r as its parent
March 2006
Vineet Bafna
C D
E
Sort columns
•
•
Sort columns according to
the inclusion property (note
that the columns are
already sorted here).
This can be achieved by
considering the columns as
binary representations of
numbers (most significant
bit in row 1) and sorting in
decreasing order
March 2006
Vineet Bafna
A
B
C
D
E
1
1
0
1
0
1
2
1
0
1
0
0
3
0
1
0
1
0
4
0
0
1
0
0
5
0
0
0
1
0
Add first column
•
In adding column i
–
–
Check each edge and
decide which side you
belong.
Finally add a node if
you can resolve a clade
A
B
C
D
E
1 2 3 4 5
1 1 0 0 0
0 0 1 0 0
1 1 0 1 0
0 0 1 0 1
1 0 0 0 0
r
u
A
March 2006
Vineet Bafna
C
E
B
D
Adding other columns
•
Add other
columns on
edges using the
ordering
property
A
B
C
D
E
1
1
0
1
0
1
2
1
0
1
0
0
3
0
1
0
1
0
4
0
0
1
0
0
r
1
E
3
2
B
5
4
D
C
March 2006
Vineet Bafna
A
5
0
0
0
1
0
Unrooted case
•
•
Switch the values in each column, so that 0 is
the majority element.
Apply the algorithm for the rooted case
March 2006
Vineet Bafna
March 2006
Vineet Bafna
March 2006
Vineet Bafna