20061214090010004-150394

Download Report

Transcript 20061214090010004-150394

SCB Workshop 14.12.2006
Isaac Newton Institute, Cambridge
Estimating Genealogies
from Marker Data
Dario Gasbarra
Matti Pirinen
Mikko Sillanpää
Elja Arjas
Biometry Group
Department of Mathematics and Statistics
University of Helsinki
Outline of the presentation
 The Problem
 Description of the method
 Probability
model
 Computational
aspects
 Example 1
 unlinked
markers
 relatedness
 Gasbarra
estimation (with a pedigree)
et al.(2006): Estimating Genealogies from Unlinked
Marker Data: a Bayesian Approach (under revision)
 Example 2
 linked
markers
 haplotyping
 relatedness
 Gasbarra
estimation (with IBD-alleles)
et al.(2006): Estimating Genealogies from Linked
Marker Data: a Bayesian Approach (under preparation)
A Basic question in statistical genetics
 Consider a population evolving in time
 Inverse problem
 Current
state of the process is known
- individuals alive at the moment
 What
was the path leading to this state?
- family structures (pedigree)
- inheritance patterns
Why is the recent past important?
 Relatedness estimation
 In
which parts of the genome a group of individuals share
alleles (identical-by-descent)?
 gene
mapping
 Haplotyping
 Ancestral
meioses have formed the haplotypes of the
contemporary individuals
Current methods on KNOWN pedigrees
 Exact calculations on known pedigrees
 Elston-Stewart
algorithm
- A few markers, not too complex pedigrees
 Lander-Green
algorithm
- Small pedigrees, many markers
 Approximative calculations on known pedigrees
 McMC
methods (e.g. Simwalk2 [Sobel et al.], Loki [Heath])
What if the pedigree is not known?
 There may be only partial pedigree data available.
 Small pedigrees might share common ancestors already
within a couple of generations backwards in time
What we do …
 Consider a sample of individuals from a population
 Genotype
data on (possibly linked) markers
 Model the pedigree and the gene flow explicitly, applying a
construction which proceeds backwards in time
 Recombinations
 Non-random
modelled based on genetic distance
mating allowed
 Devise an McMC sampler with good mixing properties
 Extends,
because of computational reasons, only tens of
generations backwards in time
… and what we hope to get
 Obtain useful summary statistics
 E.g.
estimates of IBD-probabilities between pairs of sampled
individuals
 Use the algorithm to perform numerical intergration over
model unobservables
 E.g.
in gene mapping, when combined with a phenotype
model, to account for shared ancestry
The frame of study
 Assume that we have fixed
 A population
whose size we know for T-1 (non-overlapping)
generations backwards in time (T~10)
N
sampled individuals from the current generation
 Marker
map with M markers and known recombination
fractions
 Allele
frequencies at the population level for each of the
markers
A (prior) model for a possible history
 A configuration C consists of
a
pedigree
 allelic
paths
 Specify probabilities for
 Pedigree
graph, Pg(C)
 Recombination
 Founder
events, Pr(C)
alleles, Pa(C)
 The total probability for C is
P(C) = Pg(C) x Pr(C) x Pa(C)
A probability model for pedigrees
 For fixed
 number
of generations,T-1, backwards in time
 population
size in each generation (number of ♂ and ♀)
 sample
of size N from the current generation
 mating
parameters α and β
 To simulate a pedigree from the distribution we use
 Proceed
 Let
generation by generation from 0,…,T-1.
children choose parents according to Pólya urn scheme,
where α affects the correlation of choices of fathers and β affects
the correlation of choices of mothers given the choices of fathers.
 Gasbarra D, Sillanpää M, Arjas E (2005) Backward Simulation of Ancestors of
Sampled Individuals. Theor Pop Biol 67:75-83.
Children choosing fathers
 Suppose k children have chosen their fathers from among
N_m males of the population
 Ch(m)
is the number of children that have chosen male m
 P(k+1 chooses father m) ~ α + Ch(m)
 Small
α implies dominant males
 Large
α implies that the number of offspring does not vary
much between different males
Children choosing mothers
 Suppose k children have chosen their mothers from among
N_F females of the population
 Ch(m,f)
is the number of children who have chosen male m and
female f as his/her parents
 P(k+1 chooses mother f | the father of k+1 is m) ~ Ch(m,f)+β
 Small
β implies faithful males (monogamy in large populations)
 Large
β implies random mating
Examples with different parameters
 Left: a few dominant males + monogamy
 Middle: a few dominant males
 Right: Random mating
Probability for allelic paths
 For each non-founder haplotype in the pedigree form the
expression
 Take the product of these over all haplotypes to obtain
Pr(C)
 Consider all founder alleles and take the product of the
corresponding population allelle frequencies to get Pa(C)
(founders are assumed to be in H-W and linkage
equilibrium)
Data
 Assume that we also have
 Genotype
data of the sampled individuals on M markers
 The (posterior) probability in our model is
π(C) ~ Pg(C) x Pr(C) x Pa(C) x I(C cons. with the data)
 We are able to sample efficiently from the prior but not from
the posterior
Markov chain Monte Carlo sampling
 We generate a Markov chain whose state space consists of
all configurations consistent with the data and whose
stationary distribution is our posterior (Metropolis-Hastings
algorithm)
 Highly dependent variables (close relatives and linked
markers) require large block updates
Proposals
 Different versions of proposals
 A (randomly
chosen) group of children chooses (possibly new)
parents and transmits their alleles to these parents
 All
children of a fixed father/mother choose (possibly new)
mother/father and transmit their alleles to her/him
 One
 All
child at a time chooses parent(s) and transmits alleles
children within the group jointly choose new parents and
transmit alleles
 Pedigree
is not changed but new allele paths are proposed
Schematic representation of some updates in
the MCMC algorithm
Example 1:
Relatedness estimation with unlinked markers
 Simulated data
 20
generations ago a single founder population divided into 3
population isolates
 Our
sample contains 10 sibships of 3 individuals from each of
the 3 populations (i.e. 90 individuals altogether)
Relatedness matrix estimated from pedigrees
Qualitative reconstruction with dendrogram
Same data analyzed by STRUCTURE
3 pop
10 pop
30 pop
Real data example: individuals sampled from
Eastern and Western Finland: 31 unlinked
microsatellite markers
Example 2:
The case of
linked markers
 Simulated pedigree
 10
generations
 Youngest generation
 39
individuals divided into
 13
nuclear families
 Genotype data
 20
markers / 10 alleles
 Recombination
0.05
fraction
Reconstruction
 We gave the algorithm
 The
genotype data on the youngest generation
 The
(correct) marker map
 The
(correct) allele frequencies
 The
population structure
 The algorithm was run for 500,000 iterations
Reconstructing the pedigree
Reconstructing the haplotypes
 The accuracy of the haplotype reconstruction can be
measured with the concept of switch distance (SD)
 SD between two pairs of haplotypes is the number of phase
relations between neighboring loci that need to be changed
in order to turn the first pair of haplotypes to the other
 If correct haplotypes were (111111,222222) then
 (111222,222111)
has SD=1
 (112211,221122)
has SD=2
 (121212,212121)
has SD=5
Reconstructing the haplotypes
 The SDs between the reconstructed and the true haplotype pairs of the
youngest generation (sum over all 39 individuals)
Reconstructing the IBD sharing
 We consider those alleles IBD (identical by descent) that
trace back to a common ancestral allele at the founder level
(9 generations backwards in time)
 It is possible to calculate a single quantity that measures
the proportion of the genome that two individuals share
(coefficient of relatedness r)
 It is also possible to compare the IBD sharing more
accurately along the chromosome
Comparison with IBS-based estimators
Distribution of L_2 errors (741values)
Lynch (1988)
Lynch et Ritland (1999)
Wang (2002)
Sums:
1.93
3.25
3.27
3.51
Reconstructing IBD
Future work
 Possibility of fixing some parts of the pedigree
 Extending partially known genotype data to the known
pedigree
 Pirinen,
Gasbarra (2006): Finding consistent gene
transmission patterns on large and complex pedigrees. IEEE
Trans. Comp. Biol. Bioinf. 3:252-262
Future work
 Adding a QTL or phenotype model to the algorithm
 Allowing for mutations and considering evolutionary time
scales (Ancestral Recombination Graph)
 Running many chains in parallel ”in different temperatures”
 McMcMC
with 20 processors achieved a slightly better
accuracy in 12 hours (of wall-clock time) than a single
processor in 5 days
Thanks
Matti Pirinen
Dario Gasbarra
Mikko Sillanpää