Genetic mapping

Download Report

Transcript Genetic mapping

EXTREME VALUES, COPULAS AND
GENETIC MAPPING
Bojan Basrak
Department of Mathematics,
University of Zagreb, Croatia
EVA 2005, Gothenburg
1
Genetic mapping
• Genetic map gives the relative positions of genes on the
chromosomes with distances between them typically measured
in centimorgans (cM)
• Linkage analysis aims to find approximate location of genes
associated with certain traits in plants and animals.
• It is a statistical method that compares genetic similarity between
two individuals (at a marker) to similarity of their physical or
psychological traits (phenotype).
• Among the most studied traits are inheritable diseases.
2
QTL
• Quantitative trait: A measurable trait that shows
continuous variation, e.g. skin pigmentation, height,
cholesterol, etc.
• Quantitative traits are normally influenced by several
genes and the environment.
• QTL or quantitative trait locus: a locus (or a gene)
affecting quantitative trait.
• There is even The Journal of Quantitative Trait Loci.
3
• Genetic similarity between two individuals at a given
locus is typically measured by a number called identity
by descent (IBD) status.
• Two genes of two different people are IBD if one is a
physical copy of the other, or if they are both copies of
the same ancestral gene.
• For any two people IBD status is a number in the set
{0,1,2}. In real-life, this number typically needs to be
estimated.
4
• Linkage analysis is very effective with Mendelian
inheritance.
• Mapping genes involved in inheritable diseases can be
done by comparing IBD status of affected relatives (e.g.
breast cancer)
• Mapping QTLs in animals or plants is performed by
arranging a cross between two inbred strains, which are
substantially different in a quantitative trait (e.g. tomato
fruit mass or pH).
5
IBD status of two half sibs
Mother chromosomes
Chromosomes of two half sibs
Sib 1
After two meiosis and
some other developments
Sib 2
X(t)= number of alleles
identical by descent
t
X(t)=0, X(s)=1
6
s
distance
in Morgans
• Recombinations, or more specifically, locations of
crossovers in meiosis are frequently modelled by a
stochastic process (standard choice is the Poisson
process, suggested by Haldane in 1919.)
• The process (X(t)) is an ON-OFF process in the case of
half-sibs, or sum of two independent such processes in
the case of siblings.
• In particular, under Poisson process model, (X(t)) is a
stationary Markov process. Moreover, X(t) is Bernoulli
distributed for each t in the case of half sibs.
7
• In the Haldane model, we have
where
is the recombination probability.
• For simplicity, we assume that IBD status is known at
each marker (i.e. markers are completely genetically
informative).
8
• Human genome consists of over 3 10^9 basepairs (in
two copies) on 23 chromosomes. The average length of
a chromosome is 140 cM.
• Total length of female (autosomal) genome is 4296cM
• Total length of male genome is 2851 cM
• That is: there is 1 expected crossover over 105 Mb in
males and over 88 Mb in females. Thus, on human
genome, 1 cM approximately equals 1Mb.
9
Data
• From n sib-pairs we observe
- a sequence of iid phenotypes, with continuous
marginal distribution
and
- a sequence of iid processes
10
IBD 1 at t
IBD 0 at t
11
Haseman-Elston
• In 1972, they suggested to test whether there is a linear
regression with negative slope between
• Soon, this became the standard tool for mapping of
QTLs in human genetics
12
Variance Components Model
• Variance components model (Fulker and Cherny)
essentially assumes that the joint distribution of the
phenotypes is
• bivariate normal, conditionally on the IBD status x,
with the same marginal distributions,
• and the correlation
13
Linkage Analysis
• The main question:
– Does higher IBD status mean stronger dependence
between the two trait values?
In variance components model this translates into
the test of H :
against
H:
o
A
14
Test statistic
• Statistical test is based on the log-likelihood ratio
statistic
• Or (equivalently) on the efficient score statistic
15
• Where
is the score function, and
is appropriate entry of Fisher information matrix and
needs to be estimated in practice.
16
Z(t)
t
17
max
Significance in genome-wide scans
• If we have more than one marker we need to deal with the issue
of multiple testing. The solution of this problem depends on the
intermarker spacings and the sample size.
• One could use permutation tests or other simulation based
methods to obtain p-values.
• If the sample size is large, one can apply a nice asymptotic
theory that determines significance thresholds from the analysis
of extremes of certain Gaussian processes (see. Lander and
Botstein, Siegmund et al.)
18
• For an illustration, we assume that the markers are
“dense”, that is IBD status is measured continuously
along the genome. It turn’s out that under our
assumptions and the null hypothesis one can show that
where
is Ornstein-Uhlenbeck process with
mean zero and covariance function
over each chromosome.
19
• Now, approximate thresholds for a given significance
level can be obtained by studying extremes of OrnsteinUhlenbeck process (cf. Leadbetter et al) over finite
interval. Hence, we get
• For 23 human chromosomes with average length of 140
cM and significance level 0.05 we get threshold
b=4.08 (3.62 on LOD scale).
20
Disadvantages
• Normality assumption is frequently questionable
• Correlation can be a very bad measure of dependence if
this assumption does not hold
Risch and Zhang (1995) show how
"The majority of such pairs provide little power to detect
linkage; only pairs that are concordant for high values,
low values, or extremely discordant pairs (for example,
one in the top 10 percent and other in the bottom 10
percent of the distribution) provide substantial power"
22
Copula
•
Copula of a random pair
of the random vector
is the distribution function C
where we assume that the marginal distributions F and F of
Y and Y are invertible. Hence the marginal distributions of the
copula are both uniform on [0,1].
It is well known that the distribution of a random pair splits
into two marginal distributions and the copula. Also copula is
invariant under continuous increasing transformations.
1
1
•
2
23
2
Linkage analysis rephrased
• The main question:
– Does higher IBD status mean stronger dependence
between the two trait values?
could be rephrased as
– Does higher IBD status mean that the two trait
values have “more diagonalized” copula?
Note: marginal distributions do not change with IBD status.
26
Normal Copula
• Normal copula is a copula of a normally distributed
random vector. Thus, if
then the random vector
normal copula.
Since it depends only on
has the bivariate
we denote it by
27
Bivariate Normal Copula
28
New Model
• Assume that the pair
has
• the same copula as in the variance components
model, i.e.
conditionally on the IBD status x
• and the same (but arbitrary) continuous marginal
distribution i.e. F = F .
1
2
29
• The model is not so new after all, equivalently, there is
an h such that
satisfies the assumption of the v.c. model.
• Suppose that
has the standard normal
distribution function then
That is
30
We can proceed in two ways:
a) we could guess (estimate) h, or
b) we could guess (estimate) F
The first method is already frequently applied in practice,
while the second one is easier to justify using the empirical
distribution function of the phenotypes.
1
To estimate F we may use data from a larger sample if
available.
1
31
Transformation
• In practice we might have only 2n sib-pairs to estimate
marginal distribution. So we could use
• Transformed phenotypes are
32
• If
, one can show the following
Theorem
as
• Observe that we essentially use van der Waerden
normal scores rank correlation coefficient to measure
dependence between the traits.
• Klaassen and Wellner (1997) showed that this is
asymptotically efficient estimator of the correlation
parameter in bivariate normal copula model.
33
• Hence, it is also efficient estimator of the maximum
correlation coefficient.
• For a pair of random variables Y and Y , maximum
correlation coefficient is defined as
1
2
where supremum is taken over all real transformations
a and b such that a(Y ) and b(Y ) have finite nonzero
variance.
1
2
34
Simulation study
35
Application - Lp(a)
• Twin data on lipoprotein levels, collected in 4
populations in three countries (Australia, the
Netherlands, Sweden).
• Analysis was performed using the variance components
method and published by Beekman et al. (2003).
36
Ad hoc transformation
37
Lp(a) - chromosome 1
38
Lp(a) - chromosome 6
39
Discussion
• The normal copula based method has correct critical levels under
the null hypothesis for any marginal distribution. Its power
seems to be close to optimal.
• The method easily extends to general pedigrees, discrete data,
multiple QTLs, etc.
• It is straightforward to implement in any existing software.
• Other families of copulas (Clayton, Gumbel, etc.) could be more
suitable in certain applications.
40
Acknowledgments
•
•
•
•
C. Klaassen (UvA, Eurandom)
D. Boomsma (VUA)
M. Beekman (LUMC)
N. Martin (Australia)
43