2002-09-10: Segregation Analysis I
Download
Report
Transcript 2002-09-10: Segregation Analysis I
Lecture 5: Segregation Analysis I
Date: 9/10/02
Counting number of genotypes, mating types
Segregation analysis: dominant, codominant,
estimating segregation ratio
Testing populations: polymorphism,
heterogeneity, heterozygosity, allele
frequency.
Probability: The Need for
Permutations and Combinations
Often, particularly in genetics, the sample
space consists of all orders or arrangements
of groups of objects (usually genes or alleles
in genetics).
Permutations, combinations, and
combinations with repetition exist to handle
this elegantly.
Probability: Permutation
Definition: A permutation is the number of ways
one can order r elements out of n elements. It is
often written nPr and is calculated as
n!
n pr
n r !
Example: How many different types of
heterozygotes exist when there are l alleles and we
distinguish order (e.g. paternal vs. maternal)?
Probability: Combination
Definition: A combination is the number of
ways you can select r objects from n objects
without regard to order. It is written as nCr
and has value
n
n!
n
Cr
r r!n r !
Example: How many different
heterozygotes exist without regard to order
when there are l types of alleles?
Probability: Combination with
Repetition
Definition: Suppose there are n different types of
elements and r are selected with replacement, then
the number of combinations is given by C’(n, r) =
n+r-1Cr.
Examples:
How many genotypes are possible when there are l
alleles?
How many mating types are possible when there are
l alleles?
Review: Segregation Ratio
Recall that the law of segregation states that
one of the two alleles of a parent is randomly
selected to pass on to the offspring.
Definition: The segregation ratios are the
predictable proportions of genotypes and
phenotypes in the offspring of particular
parental crosses. e.g. 1 AA : 2 AB : 1 BB
following a cross of AB X AB.
Segregation Ratio Distorition
Definition: Segregation ratio distortion is a
departure from expected segregation ratios.
The purpose of segregation analysis is to
detect significant segregation ratio distortion.
A significant departure would suggest one of
our our assumptions about the model wrong.
Segregation Analysis: What it
Teaches Us
Genetic model for a single locus gene: dominant,
codominant, truly single locus
Other genetic information: selection-free,
completely penetrant.
Data quality: systematic error, non-random
sampling.
Few important genes are single-locus. Often single
locus analysis is used to verify marker systems.
Segregation Analysis:
Experimental Design
Run a controlled cross with known expected
segregation ratios. OR
Sample offspring of particular mating type
with known expected segregation ratios.
Verify segregation ratios.
Autosomal Dominant
Mating
Type
Genotype
DD
Dd
dd
DDxDD
1
0
0
0.5
0.5
0
A DDxDd
DDxdd
0
1
0
B DdxDd 0.25
0.5
0.25
C Ddxdd
0
0.5
0.5
ddxdd
0
0
1
Phenotype
Dominant Recessive
1
0
1
0
1
0
0.75
0.25
0.5
0.5
0
1
Autosomal Dominant: The
Data and Hypothesis
Obtain a random sample of matings between
affected (Dd) and unaffected (dd)
individuals.
Sample n of their offspring and find that r are
affected with the disease (i.e. Dd).
H0: proportion of affected offspring is 0.5
Autosomal Dominant:
Binomial Test
H0: p = 0.5
If r n/2
observe 29
p-value = 2P(X r)
If r > n/2
p-value = 2P(X n-r)
n
c
n
P(X c) = 1
x 2
x 0
p-value = 0.32
Autosomal Dominant: Standard
Normal Test
m = np
s2 = np(1-p)
Z X np ~ N np, np1 p
np1 p 1/ 2
Under H0, X ~ N(n/2,n/4)
r n/2
z
n / 4
1/ 2
1.13
observe 29
p-value = 0.26
Autosomal Dominant: Pearson
Chi-Square Test
The distribution of the sum of k squares of iid
standard normal variables is defined as a chi-square
distribution with k degree of freedom.
2
X
np
Z2
~ 2
np1 p
1
2
2
X
np
n
X
n
1
p
Z2
np
n1 p
2
r
n
/
2
z2
n/4
1.28
p-value = 0.26
Continuity Correction
Both the normal and chi-square are
continuous distributions, but our data is not.
Continuity correction for Normal: r = 28.5
corrected p-value = 0.32
Continuity correction for Chi-Square:
r = 28.5; n-r = 21.5
corrected p-value = 0.32
Autosomal Dominant:
Likelihood Ratio Test
n r
nr
L
p
p
1
p
Write likelihood:
r
r
Calculate the MLE under HA:
Calculate the G statistic:
pˆ
n
c
oi
G 2log LA log L0 2 oi log
ei
i 1
r
nr
2 r log
n r log
0.5
0.5
2
Determine G distribution: G ~ 1
Calculate p-value = 0.26
Estimating Segregation Ratio:
MOM
first moment = np
sample moment = r
MOM: np = r
MOM estimate: p r
n
Estimating Segregation Ratio:
Likelihood Method
Set score to 0:
r nr
0
pˆ 1 pˆ
Solve for mle:
r
pˆ
n
Estimating Confidence Interval
for Segregation Ratio
Our estimate is X/n, where X is the random variable
representing the number of “successes” observed
and n is the sample size.
E(X/n) = E(X)/n = np/n = p
Var(X/n) = Var(X)/n2 = np(1-p)/n2 = p(1-p)/n
1/ 2
ˆ
ˆ
p
1
p
/
n
SE(X/n) =
Therefore, X/n is unbiased and we can obtain a
confidence interval using a normal approximation
with SE(X/n).
Estimating Confidence Interval
for Segregation Ratio
29
pˆ
0.58
50
1/ 2
29 21
SE pˆ 50 50
50
0.0698
pˆ 1.96SE, pˆ 1.96SE 0.443,0.717
Segregation Analysis:
Codominant Loci I
Mating Type
DDxDD
DDxDd
DDxdd
DdxDd
Ddxdd
ddxdd
DD
1
0.5
0
0.25
0
0
Genotype
Dd
0
0.5
1
0.5
0.5
0
dd
0
0
0
0.25
0.5
1
Segregation Analysis:
Codominant Loci II
All 6 mating types are identifiable.
Each mating type can be tested for agreement with
expected segregation ratios.
Some mating types result in 3 types of offspring.
Must use Chi-Square or likelihood ratio test.
Multiple Populations: Testing
for Heterogeneity
Suppose you observe segregation ratios in samples
of size n in m populations.
Calculate a total chi-square:
m n o e 2
ij
ij
2
total
i 1 j 1
eij
Calculate a pooled chi-square: 2
n
2
pooled
j 1
m
m
oij eij
i 1
i 1
m
e
i 1
ij
Multiple Populations: Testing
for Heterogeneity
Then,
2
total
2
pooled
~
2
n ( m1)
Multiple Populations: Testing
for Heterogeneity
Alternatively, one may calculate G statistics.
2
Then, Gtotal –Gpooled is also distributed as n ( m1)
oij
Gtotal 2 oij log
e
i 1 j 1
ij
m
oij
n m
i 1
Gpooled 2 oij log m
j 1 i 1
e
ij
i 1
m
n
Multiple Populations: Example
In Mendel’s F2 cross of smooth and wrinkled
inbred pea lines, he sampled 10 plants and
counted the number of smooth and wrinkled
peas produced by each of those plants.
Is there heterogeneity between plants?
Further tests show that
single gene controls smooth vs. wrinkled
smooth is dominant to wrinkled
Screening Markers for
Polymorphism
An important step in designing mapping studies is
to find markers that show polymorphism. We are
interested in tests for polymorphism.
A false negative would result if the marker was
truly polymorphic, but our test showed it to be
monomorphic.
A false positive would result if the marker was truly
monomorphic, but our test showed it to be
polymorphic.
Testing for Polymorphism:
Backcross 1:1
You design a backcross experiment to test for
polymorphism at a marker of interest. You
sample n offspring of the backcross.
P(monomorphic) = 2(0.5)n
Testing for Polymorphism: F2
codominant 1:2:1
You design a F2 cross with a marker that is
codominant. You sample n F2 individuals.
P(monomorphic) = 2(0.25)n + (0.5)n
Testing for Polymorphism: F2
dominant marker
You design an F2 cross, but this time observe
a dominant marker. You sample n F2
individuals.
P(monomorphic) = (0.75)n + (0.25)n
Power of Test for
Polymorphism
Power to Detect Polymorphism
1.2
0.8
1:1
0.6
1:2:1
0.4
3:1
0.2
Sample Size
19
17
15
13
11
9
7
5
3
0
1
Power
1
Estimating Heterozygosity
l
H 1 p
i 1
2
i
n
2
ˆ
H
1 pˆ i
n 1 i 1
l
2
l
l
n
3
2
Var Hˆ
p
p
i
2 i
n 1 i 1
i 1
Estimating Allele Frequency
It is often assumed that alleles have equal
frequencies when there are many alleles at a
locus. This assumption can result in false
positives for linkage, so it is important to test
allele frequencies.
Suppose there are l possible alleles A1, A2,
…. You observe nij genotypes AiAj.
You estimate genotypes frequencies p̂ij
Estimating Allele Frequencies
1 l
pˆ i pˆ ii pˆ ij
2 j i
1
Var pˆ i
pi 1 pi pi2 pii
2n
pi 1 pi
under HWE
2n
1
pij 4 pi p j
Cov pˆ i , pˆ j
4n
1
pi p j under HWE
2n
Probability of Observing an
Allele
Suppose there is an allele Ai with frequency
pi. What is the probability of sampling at
least one allele of type Ai?
Pobserving at least one allele Ai 1 1 pi
2n
sample
size
calculation
log 1 i
n
2 log 1 pi
Probability of Observing
Multiple Alleles
Let i be the probability of observing at least one
allele of type i.
l
There are jm m ways of selecting m different
alleles and an associated probability (jm) of
detecting at least one of each calculated from the i.
Then we can calculate the probability of observing
k or more alleles by summing over these
probabilities for k, k+1, …, l.
Approximate Probability of
Observing k or More Alleles
The above procedure becomes computationally
difficult when there are many alleles and the
frequencies are unequal.
There is a Monte Carlo approximation.
Select a random variable Ii to be 1 with probability
i and 0 otherwise.
Compute I I for b bootstrap trials. The
proportion of trials with Ik is an estimate of the
probability of observing k or more alleles.
l
i 1
i
Summary
Permutation and combinations: knowing how to
count number of genotypes, mating types, etc.
Testing segregation ratios for dominant and
codominant loci.
Testing for population heterogeneity.
Screening for polymorphism.
Estimating heterozygosity, probability of observing
and allele.