Transcript statgen7

LINKAGE ANALYSIS
Recombination Fraction
 During synapsis, crossing-over may occur between
any two non-sister chromatids
 If there are allelic differences at the site of crossing
over, the genetic result is recombination
 Genes on the same chromosome are connected
physically (syntenic)
 At least a thousand to several thousand for each
human chromosome
Recombination fraction
 Prophase of the first meiotic cell division



Homologous pairing of chromosomes (synapsis)
Each chromosome consists of two fully formed
chromatids joined at the centromere; four
chromatids for each pair of chromosomes
Each chromatid represents a separate DNA
molecule
Recombination fraction
 If two syntenic genes are close enough that a




crossover occurs between them less than once per
meiosis, on the average, the two genes are
genetically linked
Recombination fraction (RF or q) is 1/2 the frequency
of crossovers (a single crossover involves two of four
chromatids in a synapsed pair of chromosomes)
If two loci are so far apart that on average there is at
least one crossover between them in every meiosis,
then q = 50%, the loci are unlinked
q can not be greater than 50%
Morton et al. (1982) used cytological preps of
spermatocytes and reported an average of 52
crossovers per male meiosis
Recombination fraction
 Expressed in map units or centiMorgans (cM)
 1 mu = 1 cM; q of 1% over small distances
 One crossover on average implies a genetic map length of
50 cM
 If two loci are separated by a distance such that an
average of one crossover occurs between them in every
meitotic cell, then those loci are 50 cM apart
 52 crossovers implies a total genetic map length of 2600
cM in humans; thus, 1 cM equals approximately 1
megabase of sequence
 Not additive over long distances due to multiple crossovers
(positive or negative interference); mapping functions have
been developed to address this phenomenon
Recombination fraction
Recombinat ion fraction (q ) 
number of recombinan t gametes
total gametes
Linkage
 Linkage describes the phenomenon whereby
allele at neighbouring loci are close to one
another on the same chromosome, they will
be transmitted together more frequently than
chance.
 q = 0 : no recombination => complete linkage
 q < 0.5 : partial linkage
 q = 0.5 : no linkage
Linkage Analysis
 For a couple of which the
genotypes at the A and B are
known, the probability of
observing the genotypes of
the offspring depends on the
value of q
 Let us assume the following
crossing:
 Therefore, such a couple
can have 4 types of offspring
Coupling/Repulsion
 There are two possible situations:


The alleles A1 and B1 may be on the same
chromosome within the pair, in which case A1
and B1 are said to be "coupled";
They may be on different chromosomes, in
which case A1 and B1 are said to be in a state
of "repulsion".
Linkage analysis
 Assuming that there is gamete equilibrium at the A and B loci, in parent 1
there is a probability of 1/2 that alleles A1 and B1 will be coupled, and a
probability of 1/2 that they will be in repulsion.
 (1) A1 and B1 are coupled,



The probability that parent (1) provides the gametes A1B1 and A2B2 is (1q)/2 and the probability that this parent provides gametes A1B2 and A2B1 is
q/2. The probability that the couple will have child of type (1) or (2) is (1-q)/2,
and that of their having a type (3) or type (4) child is q/2.
The probability of finding n1 children of type (1), n2 of type (2), n3 of type
(3) and n4 of type (4) is therefore
[(1- q)/2]n1+n2 x (q/2)n3+n4
 (2) A1 and B1 are in a state of repulsion



The probability that parent (1) provides the gametes A1B2 and A2B1 is (1q)/2 and the probability that this parent provides gametes A1B1 and A2B2 is
q/2.
The probability of the previous observation is therefore:
(q/2)n1+n2 x[(1-q)/2]n3+n4
Linkage analysis
 With no additional information about the A1 and B1 phase, and
assuming that the alleles at the A and B loci are in a state of
coupling equilibrium, the probability of finding n1, n2, n3 and n4
children in categories (1), (2), (3), (4) is: p(n1,n2,n3,n4/q)=1/2{[(1 q)/2]n1+n2 x (q/2)n3+n4 + (q/2) n1+n2 x [(1-q)/2] n3+n4}
 So the liklihood of q for an observation n1, n2, n3, n4 can be written:
L(q/n1,n2,n3,n4)=1/2 {[(1-q)/2]n1+n2 (q/2)n3+n4 + (q/2) n1+n2 [(1-q)/2] n3+n4}
Special case: number of children n= 1
 Regardless of the category to which this child belongs
 L(q) = 1/2 [(1-q)/2] + 1/2 [q/2] = 1/4
 The likelihood of this observation for the family does not depend on
q. We can say that such a family is not informative for q.
Informative families
 An "informative family" is a family for which the liklihood is a
variable function of q.
 One essential condition for a family to be informative is,
therefore, that it has more than one child. Furthermore, at least
one of the parents must be heterozygotic.
 Definition: if one of the parents is doubly heterozygotic and the
other is
 A double homozygote, we have a backcross
 A single homozygote, we have a simple backcross
 A double heterozygote, we have a double intercross
Definition Of The "Lod Score" Of A
Family
 Take a family of which we know the genotypes at the A and B loci of
each of the members.
 Let L(q) be the likelihood of a recombination fraction 0 and q < 1/2
 L(1/2) is the likelihood of q = 1/2, that is of independent segregation into
A and B.
 The lod score of the family in q is:
Z(q) = log10 [L(q)/L(1/2)]
 Z can be taken to be a function of q defined over the range [0,1/2].
 The likelihood of a value of q for a sample of independent families is the
product of the likelihoods of each family, and so the lod score of the
whole sample will be the sum of the lod scores of each family.
Test For Linkage
 Several methods have been proposed to detect linkage: "U






scores", were suggested by Bernstein in 1931, "the sib pair test"
by Penrose in 1935, "likelihood ratios" by Haldane and Smith in
1947, "the lod score method" proposed by Morton in 1955 (1).
Morton’s method is the one most commonly used at present.
The test procedure in the lod score method is sequential (Wald,
1947 (2)). Information, i.e. the number of families in the sample,
is accumulated until it is possible to decide between the
hypotheses H0 and H1 :
H0 : genetic independence q = 1/2
H1: linkage of q1 0 < q1 < 1/2
The lod score of the q1 sample
Z(q1) = log10 [L(q1)/L(l/2)]
indicates the relative probabilities of finding that the sample is
H1 or H0. Thus, a lod score of 3 means that the probability of
finding that the sample is H1 is 1000 times greater than of
finding that it is H0
("lod = logarithm of the odds").
Test For Linkage
 The decision thresholds of the test are usually set at -





2 and +3, so that if:
 Z(q1) > 3 H0 is rejected, and linkage is accepted.
 Z(q1) < -2 linkage of q1 is rejected.
 -2 < Z(q1) < 3 it is impossible to decide between
H0 and H1. It is necessary to go on accumulating
information.
For the thresholds chosen, -2 and +3, we can show
that:
The first degree error, (False negative) a < 10-3
The second degree error (false positive), b < 10-2
The reliability, 1-r > 0.95 " q1
r is the probability that we conclude that H1 is true
given H0
Significance of results
 In fact, what is being
tested is not a single
value of q1 relative to q1
= 1/2, but a whole set of
values between 0 and
1/2, with a step of
various size (0.01 or
0.05).
 If there is a value of q1
such that Z(q1) >3:
linkage is concluded to
exist.
 If there is a value of q1 such that
 Z(q1) = -2
 The linkage is excluded for any q1  q1
 If " q, -2 < Z(q) < 3,
no conclusion can
be drawn, the
sample is not
sufficiently
informative.
Criticism
 The proposed test has the advantage of
being very simple, and of providing protection
against falsely concluding linkage.
 However, some criticisms can be leveled, not
only against the criteria chosen , but also
against the entire principle of using a
sequential procedure .
 The number of families typed is, indeed,
rarely chosen in the light of the test results.
Estimation Of The Recombination
Fraction
 If the test, on a sample of the family, has
demonstrated linkage between the A and B loci, then
one may want to estimate the recombination fraction
for these loci.
 The estimated value of q is the value which
maximizes the function of the lod score Z, and this is
equivalent to taking the value of q for which the
probability of observing linkage in the sample is
greatest.
Recombination Fraction For A Disease
Locus And A Marker Locus
 Let us assume we are dealing with a disease
carried by a single gene, determined by an
allele, g0, located at a locus G (g0: harmful
allele, G0: normal allele).
 We would like to be able to situate locus G
relative to a marker locus T, which is known to
occupy a given locus on the genome. To do
this, we can use families with one or several
individuals affected and in which the genotype
of each member of the family is known with
regard to the marker T.
 In order to be able to use the lod scores
method described above, what is needed is to
be able to extrapolate from the phenotype of
the individuals (affected, not affected) to their
genotype at locus G (or their genotypical
probability at locus G).
Disease and Marker Locus
 What we need to know is:





the frequency, g0
the penetration vector f1, f2,f3
f1 = Pr (affected /g0g0)
f2 = Pr (affected /g0G0)
f3 = Pr (affected /G0G0)
 It will often happen that the information available for the marker is not
also genotypic, but phenotypic in nature. Once again, all possible
genotypes must be envisaged.
 As a general rule, the information available about a family concerns the
phenotype. To calculate the likelihood of q, we must envisage all the
possible genotype configurations at each of the loci, for this family,
writing the likelihood of q for each configuration, weighting it by the
probability of this configuration, and knowing the phenotypes of
individuals in A and B.
 Knowledge of the genetic parameters at each of the loci (gene
frequency, penetration values) is therefore necessary before we can
estimate q.
Estimation of L as a function of q and f
 Allele distribution. If the frequency of D is .01,
H-W equilibrium is
 Pr(Dd ) = 2x.01x.99
 Genotypes of founder couples are (usually)
treated as independent.
 Pr(Dd , dd ) = (2x.01x.99)x(.99)2
1
Dd
2
dd
Estimation of L as a function of q and f
Pedigree analyses usually suppose that, given the genotype at all loci,
and in some cases age and sex, the chance of having a particular
phenotype depends only on genotype at one locus, and is independent
of all other factors: genotypes at other loci, environment, genotypes and
phenotypes of relatives, etc.
Complete penetrance:
pr(affected | DD ) = 1
DD
Incomplete penetrance:
pr(affected | DD ) = .8
DD
Estimation of L as a function of q and f
Dd
3
2
1
5
4
dd
Dd
Dd
DD
Assume penetrances pr(affected | dd ) = .1, pr(affected | Dd ) = .3
pr(affected | DD ) = .8, and that allele D has frequency .01.
The probability of this pedigree is the product:
(2 x .01 x .99 x .7) x (2 x .01 x .99 x .3) x (1/2 x 1/2 x .9) x (2 x 1/2 x
1/2 x .7) x (1/2 x 1/2 x .8)
Estimation of L as a function of q and f
DD
TT
1
Dd
Tt
2
dd
tt
3
3
T
t
D
(1-q)/2
q/2
1/2
d
q/2
(1-q)/2
1/2
1/2
1/2
Estimation of L as a function of q and f
Two-locus founder probabilities are typically calculated assuming
linkage equilibrium, i.e. independence of genotypes across loci.
If D and d have frequencies .01 and .99 at one locus, and T and t have
frequencies .25 and .75 at a second, linked locus, this assumption means
that DT, Dt, dT and dt have frequencies .01 x .25, .01 x .75, .99 x .25 and
.99 x .75 respectively. Together with Hardy-Weinberg, this implies that
Dd
Tt
pr(DdTt ) = (2 x .01 x .99) x (2 x .25 x .75)
= 2 x (.01 x .25) x (.99 x .75)
+ 2 x (.01 x .75) x (.99 x .25).
This last expression adds haplotype pair probabilities.
Estimation of L as a function of q and f
D d
T t
d d
t t
D d
T t
Initially, this must be done with haplotypes, so that account can be
taken of recombination. Then terms like that below are summed over
possible phases. Here only the father can exhibit recombination:
mother is uninformative.
pr(kid DT/dt | pop DT/dt & mom dt/dt )
= pr(kid DT | pop DT/dt ) x pr(kid dt | mom dt/dt )
= (1-q)/2 x 1.
Two Loci: Penetrance

In all standard linkage programs, different parts of
phenotype are conditionally independent given all
genotypes, and two-loci penetrances split into products
of one-locus penetrances. Assuming the penetrances
for DD, Dd and dd given earlier, and that T,t are two
alleles at a co-dominant marker locus.




Pr( affected & Tt | DD, Tt )
= Pr(affected | DD, Tt ) Pr(Tt | DD, Tt )
= 0.8  1
Estimation of L as a function of q and f
 We assume below pop is as likely to be DT / dt as Dt / dT.
Dd
dd
T t
t t
Dd
Dd
dd
Dd
T t
T t
t t
t t
Pr (all data | q )
= pr(parents' data | q )  pr(kids' data | parents' data, q)
= pr(parents' data)  {[((1-q)/2)3  q/2]/2+ [(q/2)3  (1-q)/2]/2}
Lod Scores and LA software
 It is obvious that calculating the lod scores, despite being simple in




theory, is in fact a lengthy and tedious business. In 1955, Morton
provided a set of tables giving the lod scores for various values of q for
a disease locus and a marker locus for nuclear families with sibling
sizes of 2 to 7. However, the situations envisaged were very restrictive.
In particular, it was assumed that the disease was determined by a
dominant or recessive completely penetrating rare gene (f1=1,f2,f3=0
or f1,f2,f3=1)
.
"LIPED" written by Ott in 1974 was the pioneering software in linkage
analysis. It is able to carry out this calculation, in an extensive pedigree
for any values of q, f1, f2, f3 and for penetration as a function of age.
The "Linkage" program of Lathrop et al, 1984 is the one most often
used for gene mapping. It can be used to carry out multipoint analysis.
All the software we have described is based on the same recursive
algorithm, r (Elston and Stewart), which means that it can be used to
investigate pedigrees of any size, but that it envisages all the possible
haplotypical combinations of markers, and is therefore limited by the
number of markers to be taken into account.
LA software
 In contrast, "Genehunter", which is based on a
Markov chain principle, is limited not by the number
of markers taken into consideration in the analysis,
but by the size of the family structure.
 The very recently developed software package
"Allegro" can apply information from a large number
of markers and extended family structures.
 Analysis of gene linkage has made it possible to
construct a gene map by locating the new
polymorphisms relative to one other on the genome.
The measurement used on the gene map is not the
recombination fraction, which is not an additive
datum, but the gene distance, which we will define
below.
Linkage Analysis For Three Loci :
Interference
 Now let us consider three loci A, B and C. Let the
recombination fraction between A and B be q1, that
between B and C be q2 and that between A and C be
q3.
 Let us consider the double recombinant event, firstly
between A and B, and secondly between B and C.
Let R12 be the probability of this event. If the
crossings-over occur independently in segments AB
and BC, then:
 R12  q1q2
Interference
 If this is not the case, an





interference phenomenon is
occurring and
R12 = C q1 q2 where C1
If C < 1 the interference is said
to be positive; and crossingsover in segment AB inhibit those
in segment BC.
If C >1 the interference is said to
be negative; and crossings-over
in segment AB promote those in
segment BC.
Let us consider the case of a
triple heterozygotic individual.
Such an individual can provide 8
types of gametes.
A1
A2
B2
B1
C1 
  R12
C2 
A1
A2
B2
B1
A1
A2
B1
B2
A1
A2
B1
B2

C2 

q
R

1
12 
C1 

  q3
C2 

  q 2 - R12 
C1 

C1 
  1 - q1 - q 2  R12
C2 
Interference
 We can write that :
q3  q1  q2 -2 R12
q3  q1  q2 -2 Cq1 q2
 If C = 1





q3 = q1 + q2 - 2q1 q2
The recombination fraction is a non-additive measurement.
However, we can write
(1-2 q3) = (1-2 q2)(1-2 q2)
if x(q) = k Log (1-2q) then we have x(q3) = x(q2) + x(q1)
and for k = -1/2, x(q)~q for small values of q. x(q) = -1/2 Log (1-2
q) is an additive measurement.
It is known as the genetic distance, and is measured in
Morgans. It can be shown that x measures the mean number of
crossings-over.
Test for the presence of interference
 Let us consider a sample of families with the
genotypes A, B and C. Let Lc be the greatest
likelihood for q1, q2, q3 and L1 the greatest
likelihood when we impose the constraint C=1
(i.e. q3 = q1 + q2 - 2q1 q2 )
 Then -2 Log (L1/Lc ) follows a c2 pattern, with
one degree of freedom.
Reminder c2 distribution
The Chi-Square distribution is a skewed distribution
H0
H0
accepted
rejected
The critical region
0
4
Test Statistic
8
12
16
c2
(O - E )
c 
E
2
20
2
The critical
value depends
on the degrees of
freedom
Where ‘O’ is observed, ‘E’ is
expected value
Genetic Heterogeneity Of Localization
 The analysis of genetic linkage can be complicated
by the fact that mutations of several genes, located at
different places on the genome, can give rise to the
same disorder. This is known as genetic
heterogeneity of localization.
 One of the following two tests is used to identify
heterogeneity of this type:


The "Predivided sample test"
The "Admixture Test".
 The first test is usually only appropriate if there is a
good family stratification criterion or if each family
individually has high informativity.
II- 1. The Predivided Sample Test



This test is intended to demonstrate linkage heterogeneity in different subgroups of a sample of families. The aim is to test whether the genetic linkage
between a disease and its marker(s) is the same in all sub-groups. These
groups are formed ad hoc on the basis of clinical or geographical criteria
etc....
Let us assume that the total sample of families has been divided into n subgroups (it is possible to test for the existence of as many sub-groups as
families). qi denotes the true value of the recombination fraction of sub-group
i.
We want to test the null hypothesis H0: q1= q2 = q3 = …= qn against the
alternative hypothesis H1: the values of qi are not all equal.

Therefore, the quantity

Follows a c2 distribution with (n-1) degrees of freedom. The homogeneity of
the sample for linkage with a type-I error of the sample for linkage with a type
I error equal to a if Q is above the critical threshold c2(n-l) corresponding to a.


Q  2 log 10   Z i (q i ) - Z i (qˆ) 
 i

The Admixture Test

The "admixture test" is not based on an ad hoc subdivision of the families. It is
assumed that among all the families studied genetic linkage between the
disease and the marker is found only in a proportion a of the families, with a
recombination fraction q < 1/2. In the remaining (l-a) families, it is assumed that
there is no linkage with the marker (q=1/2).

For each family i of the sample, the likelihood is calculated
Li(a, q) = a Li(q) + (1-a) Li(1/2),

where Li(q) is the likelihood of q for family i. The likelihood of the couple (a, q)
is defined by the product of the likelihoods associated with all the families :
L(a, q)= p Li(a, q)

We test to find out whether a is significantly different from 1 by comparing Lmax(a
= 1, q), the maximized likelihood for q assuming homogeneity, and Lmax(a, q),
the maximized likelihood for the two parameters a and q (nested models).
Then variable Q =2[Ln Lmax (a, q) —Ln Lmax (a = 1, q)]
follows a c2 distribution with one degree of freedom.
Generalization Of The Admixture Test




In some single-gene diseases, several genes have been shown to exist at
different locations. This is true, for example of multiple exostosis disease, for
which 3 genes have been identified successively on 3 different chromosomes.
The "admixture test" is then extended to determine the proportion of families in
which each of the three genes is implicated , and the possibility that there is a
fourth gene.
The three locations on chromosomes 8, 19 and 11 were reported as El, E2 and
E3, and the proportions of families concerned as a1, a2 and a3 respectively. a4
was used to represent the proportion of the families in which another location
was involved.
For each family i of the sample, the likelihood was calculated using the observed
segregation within the family of the markers available in each of the three
regions, according to the clinical status of each of its members.
Li(El, E2, E3,a1, a2, a3 |Fi) = a1 (L(E1|Fi)/L(E1=1/2 |Fi)] + a2 (L(E2|Fi)/L(E2=1/2 |Fi)]
+ a3 [L(E3|Fi)/L(E3=1/2 | Fi)]+ a4

For all the families
Li(El, E2, E3,a1, a2, a3 |Ft) = i Li(El, E2, E3,a1, a2, a3 | Fi)
Each ai can be tested to see if it is equal to 0, and then the corresponding non
nulla i and Ei values are estimated.
Generalization of The Admixture Test
-Results
 It is also possible to calculate the probability that the gene
implicated is at El, E2 or E3 for each of the families in the
sample. The post hoc probability makes use of the estimated ai
proportions, but also the specific observations in this family.
 The sample investigated has been shown to consist of three
types of families: in 48% of families, the gene is located on
chromosome 8, in 24% of them on chromosome 19, and in 28%
of families the gene is located on chromosome 11. There was no
evidence of a fourth location in this sample.
 The post hoc probabilities of belonging to one of these 3 subgroups were then estimated: the probability that the gene
implicated would be on chromosome 8 was over 90% for 5
families, that it would be on chromosome 19 for 3 of them, and
that it would be on chromosome 11 for 4 families. For the other
families, the situation was less clear-cut: the post-hoc
probabilities are similar to the ad hoc probabilities because of
the paucity of information provided by the markers used.
Gamete Disequilibrium Between
Alleles At The Disease Locus And At
The Marker Locus
 An association between a susceptibility gene and a marker can
lead to bias in the estimation of the recombination fraction. In
particular, the "lod scores" method specifies that there must be
no selection for the marker in the sample.
 However, in a context of an association, selection based on the
status of the patient implicitly involves selection for a marker.
 Furthermore, the calculation assumes that the probability for
each genetic combination is equal in the parents, and this is not
true if there is an association.
 In the analysis, failing to take into account the disequilibrium
existing between disease alleles and marker alleles, induces a
very great under-estimation of the "lod score" (in other terms, a
marked reduction in the power of the linkage test) and a very
slight under-estimation of the recombination fraction.
The Problem Of Multiple Tests
 One of the difficulties encountered in the statistical interpretation of the




analyses of the genetic linkage of complex diseases arises in fact from
the fact that in general and with a varying degree of explicitness, the
data are subjected to multiple tests: several clinical classifications,
several genetic markers, several models, several samples.
It is quite clear that the discontinuation criteria usually used in the lod
score test no longer have the same statistical significance when several
tests are applied simultaneously to the same sample or to several
samples.
The situation is much more complex for multifactorial diseases,
because the multiplicity of the tests has several types of impact and
these are not independent.
Multiple tests could be taken into account by readjusting the
discontinuation criterion of the lod scores test. However, on the one
hand, it is not always clear which tests have actually been carried out,
and on the other, this can make the test too conservative.
Replication strategy- If a positive result is replicated for a new sample
(using the same classification, the same marker, the same transmission
model) this provides a reliable threshold of significance.