Reconstructing pedigrees and gene flows

Download Report

Transcript Reconstructing pedigrees and gene flows

4.1.2006
Reconstructing
Genealogies:
a Bayesian approach
Dario Gasbarra
Matti Pirinen
Mikko Sillanpää
Elja Arjas
Department of Mathematics and Statistics
We all are related … but to different degrees …
 Consider a population evolving in time
 Inverse problem:
 Suppose
the current state of the process is known
- individuals alive at the moment
 What
was the path leading to this state?
- family structures (pedigree)
- inheritance patterns
Pedigrees
 Specify relationship categories
 Parent-offspring,
full siblings / half siblings, first cousins etc.
 In graphs
 Circles
 Black
 Time
for females, squares for males
nodes represent nuclear families
runs downwards
Gene flow
 Alleles (i.e. different variants of the same gene) flow
through the pedigree
 Gene flow gives us a means to quantify the degree of
relatedness between individuals
 How
 At
much of their genome do two individuals share?
what loci do they have identical alleles?
chromosome
allele
DNA
Gene flow
 Two alleles may be identical
 by-state
(IBS)
- They have the same DNA-sequence
 by-descent
(IBD)
- They descend from the same ancestral allele within a
given reference frame
Here the children
share allele 1 IBS, but
not IBD (w.r.t their
parents’ generation).
Meiosis
 When gametes are formed the paternal and the maternal
chromosomes (haplotypes) may cross-over and recombine
Haldane’s model of recombination
 Recombination fraction θ between two loci on the same
chromosome is the proportion of meioses in which a
recombination event (i.e., an odd number of cross-overs)
takes place between the loci
 Haldane’s model assumes that crossovers occur
independently along each chromosome
a
Poisson process model follows
chromosome
17%
9%
9.5%
The frame for study
 From now on we assume that we have fixed
 A population
whose size we know for T-1 (non-overlapping)
generations backwards in time
N
sampled individuals from the current generation
 Marker
map with M markers and known recombination
fractions
 Allele
frequencies at the population level for each of the
markers
A (prior) model for a possible history
 A configuration C consists of
a
pedigree
 allelic
paths
 Specify probabilities for
 Pedigree
graph, Pg(C)
 Recombination
 Founder
events, Pr(C)
alleles, Pa(C)
 The total probability for C is
P(C) = Pg(C) x Pr(C) x Pa(C)
A probability model for pedigrees
 For fixed
 number
of generations,T-1, backwards in time
 population
size in each generation (number of ♂ and ♀)
 sample
of size N from the current generation
 mating
parameters α and β
 To simulate a pedigree from the distribution we se
 Proceed
 Let
generation by generation from 0,…,T-1.
children choose parents according to Pólya urn scheme,
where α affects the correlation of choices of fathers and β affects
the correlation of choices of mothers given the choices of fathers.
 Gasbarra D, Sillanpää M, Arjas E (2005) Backward Simulation of Ancestors of
Sampled Individuals. Theor Pop Biol 67:75-83.
Examples with different parameters
 Left: a few dominant males + monogamy
 Middle: a few dominant males
 Right: Random mating
Probability for allelic paths
 For each non-founder haplotype in the pedigree form the
expression
 Take the product of these over all haplotypes to obtain
Pr(C)
 Consider all founder alleles and take the product of the
corresponding population allelle frequencies to get Pa(C)
Data
 Assume that we also have
 Genotype
data of the sampled individuals on M markers
 The (posterior) probability in our model is
π(C) ~ Pg(C) x Pr(C) x Pa(C) x 1(C cons. with the data)
 We are able to sample efficiently from the prior but not from
the posterior
Markov chain Monte Carlo sampling
 We generate a Markov chain whose state space consists of
all configurations consistent with the data and whose
stationary distribution is our posterior (Metropolis-Hastings
algorithm)
 If this chain is irreducible then the expected values of
functions defined on the space of configurations can be
approximated with sample averages
 haplotype
configurations
 IBD-sharing
between individuals
Metropolis-Hastings algorithm
 M-H algorithm produces a chain
of configurations,
where at each step of the chain a new value
is proposed
(from some proposal distribution) and this value is either
accepted (
) or rejected (
) (according to
some rules depending on
).
 Good proposals that will be accepted quite often are needed
so that the chain moves around within a reasonable amount of
time.
Proposals
 Highly dependent variables (close relatives and linked
markers) require large block updates
 Different versions of proposals
 A (randomly
chosen) group of children chooses (possibly new)
parents transmitting their alleles to these parents
 All
children of a particular father/mother choose a (possibly
new) mother/father transmitting their alleles to her/him
 One
child at a time chooses new parent(s) transmitting alleles
to them
 All
children within the group jointly choose new parents and
transmit alleles
 Pedigree
is not changed but new allele paths are proposed
An example
 Simulated pedigree
 10
generations
 Youngest generation
 39
individuals divided into
 13
nuclear families
 Population
 200
founders
 growing
exp. by 1.2
Example continues…
 Simulated gene flow on the pedigree
 20
markers
 10
equally frequent alleles at each locus in the founder
generation
 Haldane’s
 Spacing
model of recombination (no interference)
between adjacent markers 5.3 cM (i.e. recombination
fraction 0.05)
Reconstruction
 We gave the algorithm
 The
genotype data on the youngest generation
 The
(correct) marker map
 The
(correct) allele frequencies
 The
population structure
 The algorithm was run for 500,000 iterations
Reconstructing the pedigree
Reconstructing the haplotypes
 Each individual (in diploid species) carries two copies of
each chromosome
 One
is inherited from the father (mother) and is called a
paternal (maternal) haplotype
 Genotyping
does not (usually) determine which multilocus
allelic combination is inherited from the same parent
- from lab {1,2}x{4,3}
- true haplotypes may be either (13,24) or (14,23)
 There exist two kinds of haplotyping methods
 Pedigree
based (SimWalk2, Merlin, Genehunter)
 Population
based (PHASE, HAPLOFREQ)
Reconstructing the haplotypes
 The accuracy of the haplotype reconstruction can be
measured with the concept of switch distance (SD)
 SD between two pairs of haplotypes is the number of phase
relations between neighboring loci that need to be changed
in order to turn the first pair of haplotypes to the other
 If correct haplotypes were (111111,222222) then
 (111222,222111)
has SD=1
 (112211,221122)
has SD=2
 (121212,212121)
has SD=5
Reconstructing the haplotypes
 The SDs between the reconstructed and the true haplotype pairs of the
youngest generation (sum over all 39 individuals)
Reconstructing the IBD sharing
 We consider those alleles IBD (identical by descent) that
trace back to a common ancestral allele at the founder level
(9 generations backwards in time)
 It is possible to calculate a single quantity that measures
the proportion of the genome that two individuals share
(coefficient of relatedness r)
 It is also possible to compare the IBD sharing more
accurately along the chromosome
Reconstructing IBD
 The reconstructed relatedness coefficients of each of the 741 pairs of the
individuals belonging to the youngest generation were compared with the
true values (sum of squared errors shown)
Comparison with IBS-based estimators
Distribution of L_2 errors (741values)
Sums:
1.93
3.25
3.27
3.51
Reconstructing IBD
Another example of pedigree reconstruction
 Population with 200 individuals, 50 markers / 9 alleles
Future work
 Possibility of fixing some parts of the pedigree
 Extending partially known genotype data to the known
pedigree in accordance with the Mendelian rules of
inheritance is in general an NP-complete problem
a/b
b/e
b/c
a/c
a/b
e/c c/b
b/e f/c d/d
a/f
c/d
a/b e/c c/b b/e
d/a
e/f
f/c d/d
d/f
a/f
Future work with the reconstruction algorithm
 Adding a QTL (quantitative trait locus) model to the
algorithm
 Does
phenotype correlate with IBD-sharing at some
chromosomic region(s)?
 Running many chains in parallel ”in different temperatures”
Thanks
Dario Gasbarra,
Mikko Sillanpää
and
Matti Pirinen