20061214090010004-150394
Download
Report
Transcript 20061214090010004-150394
SCB Workshop 14.12.2006
Isaac Newton Institute, Cambridge
Estimating Genealogies
from Marker Data
Dario Gasbarra
Matti Pirinen
Mikko Sillanpää
Elja Arjas
Biometry Group
Department of Mathematics and Statistics
University of Helsinki
Outline of the presentation
The Problem
Description of the method
Probability
model
Computational
aspects
Example 1
unlinked
markers
relatedness
Gasbarra
estimation (with a pedigree)
et al.(2006): Estimating Genealogies from Unlinked
Marker Data: a Bayesian Approach (under revision)
Example 2
linked
markers
haplotyping
relatedness
Gasbarra
estimation (with IBD-alleles)
et al.(2006): Estimating Genealogies from Linked
Marker Data: a Bayesian Approach (under preparation)
A Basic question in statistical genetics
Consider a population evolving in time
Inverse problem
Current
state of the process is known
- individuals alive at the moment
What
was the path leading to this state?
- family structures (pedigree)
- inheritance patterns
Why is the recent past important?
Relatedness estimation
In
which parts of the genome a group of individuals share
alleles (identical-by-descent)?
gene
mapping
Haplotyping
Ancestral
meioses have formed the haplotypes of the
contemporary individuals
Current methods on KNOWN pedigrees
Exact calculations on known pedigrees
Elston-Stewart
algorithm
- A few markers, not too complex pedigrees
Lander-Green
algorithm
- Small pedigrees, many markers
Approximative calculations on known pedigrees
McMC
methods (e.g. Simwalk2 [Sobel et al.], Loki [Heath])
What if the pedigree is not known?
There may be only partial pedigree data available.
Small pedigrees might share common ancestors already
within a couple of generations backwards in time
What we do …
Consider a sample of individuals from a population
Genotype
data on (possibly linked) markers
Model the pedigree and the gene flow explicitly, applying a
construction which proceeds backwards in time
Recombinations
Non-random
modelled based on genetic distance
mating allowed
Devise an McMC sampler with good mixing properties
Extends,
because of computational reasons, only tens of
generations backwards in time
… and what we hope to get
Obtain useful summary statistics
E.g.
estimates of IBD-probabilities between pairs of sampled
individuals
Use the algorithm to perform numerical intergration over
model unobservables
E.g.
in gene mapping, when combined with a phenotype
model, to account for shared ancestry
The frame of study
Assume that we have fixed
A population
whose size we know for T-1 (non-overlapping)
generations backwards in time (T~10)
N
sampled individuals from the current generation
Marker
map with M markers and known recombination
fractions
Allele
frequencies at the population level for each of the
markers
A (prior) model for a possible history
A configuration C consists of
a
pedigree
allelic
paths
Specify probabilities for
Pedigree
graph, Pg(C)
Recombination
Founder
events, Pr(C)
alleles, Pa(C)
The total probability for C is
P(C) = Pg(C) x Pr(C) x Pa(C)
A probability model for pedigrees
For fixed
number
of generations,T-1, backwards in time
population
size in each generation (number of ♂ and ♀)
sample
of size N from the current generation
mating
parameters α and β
To simulate a pedigree from the distribution we use
Proceed
Let
generation by generation from 0,…,T-1.
children choose parents according to Pólya urn scheme,
where α affects the correlation of choices of fathers and β affects
the correlation of choices of mothers given the choices of fathers.
Gasbarra D, Sillanpää M, Arjas E (2005) Backward Simulation of Ancestors of
Sampled Individuals. Theor Pop Biol 67:75-83.
Children choosing fathers
Suppose k children have chosen their fathers from among
N_m males of the population
Ch(m)
is the number of children that have chosen male m
P(k+1 chooses father m) ~ α + Ch(m)
Small
α implies dominant males
Large
α implies that the number of offspring does not vary
much between different males
Children choosing mothers
Suppose k children have chosen their mothers from among
N_F females of the population
Ch(m,f)
is the number of children who have chosen male m and
female f as his/her parents
P(k+1 chooses mother f | the father of k+1 is m) ~ Ch(m,f)+β
Small
β implies faithful males (monogamy in large populations)
Large
β implies random mating
Examples with different parameters
Left: a few dominant males + monogamy
Middle: a few dominant males
Right: Random mating
Probability for allelic paths
For each non-founder haplotype in the pedigree form the
expression
Take the product of these over all haplotypes to obtain
Pr(C)
Consider all founder alleles and take the product of the
corresponding population allelle frequencies to get Pa(C)
(founders are assumed to be in H-W and linkage
equilibrium)
Data
Assume that we also have
Genotype
data of the sampled individuals on M markers
The (posterior) probability in our model is
π(C) ~ Pg(C) x Pr(C) x Pa(C) x I(C cons. with the data)
We are able to sample efficiently from the prior but not from
the posterior
Markov chain Monte Carlo sampling
We generate a Markov chain whose state space consists of
all configurations consistent with the data and whose
stationary distribution is our posterior (Metropolis-Hastings
algorithm)
Highly dependent variables (close relatives and linked
markers) require large block updates
Proposals
Different versions of proposals
A (randomly
chosen) group of children chooses (possibly new)
parents and transmits their alleles to these parents
All
children of a fixed father/mother choose (possibly new)
mother/father and transmit their alleles to her/him
One
All
child at a time chooses parent(s) and transmits alleles
children within the group jointly choose new parents and
transmit alleles
Pedigree
is not changed but new allele paths are proposed
Schematic representation of some updates in
the MCMC algorithm
Example 1:
Relatedness estimation with unlinked markers
Simulated data
20
generations ago a single founder population divided into 3
population isolates
Our
sample contains 10 sibships of 3 individuals from each of
the 3 populations (i.e. 90 individuals altogether)
Relatedness matrix estimated from pedigrees
Qualitative reconstruction with dendrogram
Same data analyzed by STRUCTURE
3 pop
10 pop
30 pop
Real data example: individuals sampled from
Eastern and Western Finland: 31 unlinked
microsatellite markers
Example 2:
The case of
linked markers
Simulated pedigree
10
generations
Youngest generation
39
individuals divided into
13
nuclear families
Genotype data
20
markers / 10 alleles
Recombination
0.05
fraction
Reconstruction
We gave the algorithm
The
genotype data on the youngest generation
The
(correct) marker map
The
(correct) allele frequencies
The
population structure
The algorithm was run for 500,000 iterations
Reconstructing the pedigree
Reconstructing the haplotypes
The accuracy of the haplotype reconstruction can be
measured with the concept of switch distance (SD)
SD between two pairs of haplotypes is the number of phase
relations between neighboring loci that need to be changed
in order to turn the first pair of haplotypes to the other
If correct haplotypes were (111111,222222) then
(111222,222111)
has SD=1
(112211,221122)
has SD=2
(121212,212121)
has SD=5
Reconstructing the haplotypes
The SDs between the reconstructed and the true haplotype pairs of the
youngest generation (sum over all 39 individuals)
Reconstructing the IBD sharing
We consider those alleles IBD (identical by descent) that
trace back to a common ancestral allele at the founder level
(9 generations backwards in time)
It is possible to calculate a single quantity that measures
the proportion of the genome that two individuals share
(coefficient of relatedness r)
It is also possible to compare the IBD sharing more
accurately along the chromosome
Comparison with IBS-based estimators
Distribution of L_2 errors (741values)
Lynch (1988)
Lynch et Ritland (1999)
Wang (2002)
Sums:
1.93
3.25
3.27
3.51
Reconstructing IBD
Future work
Possibility of fixing some parts of the pedigree
Extending partially known genotype data to the known
pedigree
Pirinen,
Gasbarra (2006): Finding consistent gene
transmission patterns on large and complex pedigrees. IEEE
Trans. Comp. Biol. Bioinf. 3:252-262
Future work
Adding a QTL or phenotype model to the algorithm
Allowing for mutations and considering evolutionary time
scales (Ancestral Recombination Graph)
Running many chains in parallel ”in different temperatures”
McMcMC
with 20 processors achieved a slightly better
accuracy in 12 hours (of wall-clock time) than a single
processor in 5 days
Thanks
Matti Pirinen
Dario Gasbarra
Mikko Sillanpää