No Slide Title
Download
Report
Transcript No Slide Title
Statistical Considerations for
Population-Based Studies in Cancer I
Special Topic: Statistical analyses of
twin and family data
Kim-Anh Do, Ph.D.
Associate Professor
Department of Biostatistics
Email: [email protected]
http://odin.mdacc.tmc.edu/~kim
1
The usual idea of a gene is a specific region of DNA that
codes for a single protein or enzyme, and the position of
a gene on a chromosome is its locus.
•The basis for research by human geneticists is to try to
identify traits, or phenotypes, whose inheritance
patterns are consistent with the action of individual
genes.
• Recent advances in genetics show that the relationship
between DNA sequence and phenotype is both more
complex and more interesting than we thought.
•Some functions of DNA do not even depend on its
nucleotide sequence, and DNA sequence variation
includes a variety of direct and indirect forms of
feedback among various regions of the DNA within and
between cells.
2
Allele and genotype frequenquencies
• The most fundamental quantitative variable in
population genetics is the allele frequency, a
prevalence measure.
• When a locus has only two alleles, denote their
frequencies p and q=1-p.
• Let Pg define the frequency of genotype g
• The frequency of an i homozygote is Pii = pi pi = pi 2
•The frequency of an ik heterozygote is Pik = 2 pi pk
•For a diallelic system the genotypes have frequency
PAA = p2
PAa = 2pq
Paa = q2
3
Frequenquency relationships between
genotype and phenotype
The concept of penetrance
• A given genotype does not always produce the same
phenotype. The association between the two is known
as the penetrance.
• Individuals with a given genotype will have some
distribution of phenotypes; the penetrance function
specifies the probability that an individual with
genotype g has phenotype
g() = Pr(|g)
4
Frequenquency relationships between
genotype and phenotype (cont’d)
• For many quantitative biological traits there is some
measurement scale on which the phenotypes are
approximately normally distributed.
g() = {1/[g (2)]} exp[-(-g)2 / 2g2
• Penetrance is a statistical, population-specific
association between genotype and phenotype, not a
biological explanation of such a relationship.
•Many factors may affect the expression of a given
genotype: genes, environmental factors, errors in
measurement or classification, sampling error etc.
5
Nuclear families and sibships
The distribution of traits in families
• A diploid, sexually reproducing organism has two
sets of genes, one inherited from each parent.
• Each time that individual produces his/her own
gamete (sperm or egg), one of his/her inherited alleles,
at each locus, will be randomly chosen and transmitted
in the gamete. There is thus a probability of ½ that an
offspring will inherit a specific parental allele. THIS
probabilistic aspect of inheritance IS A FUNDAMENTAL
ASPECT OF OUR BIOLOGY.
6
Segregation analysis: discrete traits in families
• We can understand the basic principles of genetic epidemiology
by studying the behavior of alleles at a single locus in nuclear
families.
• We can take advantage of evolution-based constraints on the
distribution of genetic variation in families.
• The analysis of trait distributions in families is known as
segregation analysis after Gregor Mendel’s Law of Segregation
of individual alleles at a locus.
• The idea is to judge if the pattern of phenotypes in families is
consistent with a genetic model.
• Families are ascertained via one or more index individuals, or
probands, who may be either randomly identified, or chosen
because of their disease or other phenotype status.
7
Nuclear families and sibships
The distribution of traits in families
• A diploid, sexually reproducing organism has two
sets of genes, one inherited from each parent.
• Each time that individual produces his/her own
gamete (sperm or egg), one of his/her inherited alleles,
at each locus, will be randomly chosen and transmitted
in the gamete. There is thus a probability of ½ that an
offspring will inherit a specific parental allele. THIS
probabilistic aspect of inheritance IS A FUNDAMENTAL
ASPECT OF OUR BIOLOGY.
8
Nuclear families and sibships (cont’d)
Transmission probabilities
• For a single diallelic locus with alleles A and a, define
the transmission probabilities t(x|g), as the probability
that a parent of genotype g produces a gamete with
allele a. These are conditional probabilities because
they depend on the genotypic state of the parent.
• For autosomal loci
t(A|AA) = 1,
t(A|Aa) = ½,
t(A|aa) = 0.
9
Table 5.1A. Genotypic mating table for an
autosomal diallelic locus
Mating
type
Empiric
mating
frequency
Offspring genotype probabilities
Under random
mating
Conditional
AA Aa aa
Unconditional
AA
Aa
aa
AA X AA
M11 p2 p2 = p4
1 0 0
p4
AA X Aa
M12 2p2 (2pq) = p3 q
½ ½ 0
2p 3q 2p3q
AA X aa
M13 2p2 q2
0
0
Aa X Aa
M22 (2pq) (2pq)= 4p2q2
¼ ½ ¼
Aa X aa
M23
2 (2pq) q2 = 4pq3
0
½ ½
0
2pq3
2pq3
aa X aa
M33
q2 q2 = q4
0
0
0
0
q4 10
1 0
1
0
0
0
2p 2q2 0
p2q2 2p2q2 p2q2
Nuclear families and sibships (cont’d)
Mating types
• The probability that an individual has a given
genotype is determined by the genotype, or mating
types, of its parents
• A nuclear family is a set of repeated selections of
offspring genotypes from the mating type, Mk l, of
parents with genotypes k and l.
• In a population (or sample),
0 <= Pr(Mk l) <= 1 ; k l Pr(Mk l) = 1; est(Mk l)= nk l/N.
• If there is random mating relative to the locus in
question, the mating type frequencies are determined
by the genotype frequencies (determined by the allele
frequencies)
11
Nuclear families and sibships (cont’d)
Transition probabilities
• Family data consists of parent-offspring triads.
• Define transition probabilities P(go|gf , gm) as the
conditional probabilities of genotypes in offspring
given those in the father and mother.
• For a diallelic locus, there are three possible
offspring genotypes (AA, Aa, aa) with transition
probabilities
t(A|f) t(A|m)
t(A|f) (1- t(A|m)) + t(A|m) (1- t(A|f))
(1 - t(A|m)) (1- t(A|f))
See Table 5.1B
12
Table 5.1B. Parent to offspring transition
probabilities for a diallelic locus
Father’s genotype
AA
Mother’s
genotype
Aa
aa
AA
{1 0 0}
{½ ½ 0} {0 1 0}
Aa
{½ ½ 0}
{¼ ½ ¼} {0 ½ ½}
aa
{0 1 0}
{0 ½ ½} {0 0 1}
13
Table 5.2. Phenotypic mating table for an autosomal diallelic locus
Mating
type
Random
mating
frequency
Offspring segregation proportions ()
Conditional
D
Dominant by dominant matings
AA X AA
p4
1
3
AA X Aa
4p q
1
2 2
Aa X Aa
4p q
¾
2 2
All D X D (1 - q )
(1+2q)/(1+q) 2
Dominant by recessive matings
AA X aa
2p2 q2
Aa X aa
4pq3
All D X R
2q2(1 - q2)
1
½
1/(1+q)
Recessive by recessive matings
aa X aa
q4
All R X R
q4
0
0
Unconditional
R
0
0
1/4
2
q /(1+q) 2
0
½
q(1+q)
1
1
R
p4
4p3q
3p2q2
p2(1+2q)
D
0
0
p2q2
p2q2
2p2q2
2pq3
2pq2
0
2pq3
2pq3
0
0
q4
q4
D = dominant, R = recessive, “All {mating phenotype}” are weighted by their population frequencies.
The segregation proportion, , can be interpreted as the probability that a random offspring is affected.
14
Segregation analysis: discrete traits in families (con’t)
Ascertainment bias and correction: sibship data
• The way in which families are ascertained can have major effect
on the interpretation we make of the data.
Example: Ascertain affected children through the school system.
Collect data on all siblings of affected.
Suppose the segregation proportion (alsp the prob that a rnadom
offspring is affected) is . The probability that a family of sibship
size s produces r affected children follows a binomial distribution
Pr(r|s, ) = s!/[r!(s-r)!] r (1- )(s-r)
Therefore the probability that such a family will produce s normal
children is (1- )s.
These families will never be identified if we ascertain
sibships through affected school children.
15
Ascertainment bias and correction: sibship data (con’t)
• Must correct for ascertainment to obtain unbiased estimates.
• One simple way: recognize that our sample contain all families,
except those with no affecteds, I.e. our sample represents a
fraction [1- (1- )s] of the total population of sibships in this
example.
• The corrected probabilities of r affected from a family of size s
Pr(r|s, ) = s!/[r!(s-r)!] r (1- )(s-r) / [1- (1- )s] .
• Another way of ascertainment correction is to perform analyses
ignoring the affected probands. This is acceptable only if the
probability that a given affected child is ascertained is small.
• Other ascertainment problems: Families with many affecteds
may have a higher chance of being ascertained by a given
sampling scheme.
• Corrections for some simple sampling situations have long
been known in medical genetics, but methods for complex
situations are still inexact.
16
Segregation analysis: quantitative traits in families
• Quantitative traits may be affected by a large number
of loci acting together, as well as by environmental
factors.
• Examples of important disease related traits:
Blood pressure; obesity measures;
cholesterol; triglycerides.
• We need to understand the effect of the genotypes,
and the environment, on the phenotype.
• The effects of genotypes on quantitative phenotypes
are relative: Does phenotype AA increase the
phenotype, or does aa decrease it?
17
Segregation analysis: quantitative traits in families
• The simplest measure of genetic effect is the genotypic
value, the mean phenotype observed amongst
individuals with a given genotype in the population of
reference
g = i i g(i)
•The mean number of doses of a given allele, say A, in genotypes in a
population is
g = 2 p2 + 2pq (1) + q2 (0) = 2p
The mean phenotype in the population is the weighted
average
= g Pg g
= p2 AA + 2pq Aa + q2 aa
for a diallelic locus
18
Genetic variation for a quantitatitve trait
• The genotypic variance is defined as the variance
among the genotypic values in the population:
g2 = Pg (g - )2 = Pg g2 - 2
= 2pq
•It is often convenient to express genotypic values as
deviations from the population mean denoted by
g = g -
• In the simplest situation, the effects of the individual
alleles are additive, and the genotypic value is the sum
of the effects of the two alleles in the genotype.
19
Genetic variation for a quantitatitve trait (cont’d)
• Define i to be the allelic value that each allele
contributes to the genotype. Since allele A is paired with
another A a fraction p of the time, and with a for q of the
time, we have
A = pAA + qAa
a = pAa + qaa
•Special characteristic of effects expressed as deviations:
Their average over all genotypes must be zero, I.e
pA + qa = 0.
When the allelic effects are additive, the breeding value,
or average deviation, of genotype ik is I + k.
20
Genetic variation for a quantitatitve trait (cont’d)
• Define the additive genotypic variance, 2A , as the sums
of squares of the breeding values, weighted by the
genotype frequencies
2A = p2 (2A)2 + 2pq (A + a )2 + q2 (2a)2
= 2(p 2A + q2a)
• Define the dominance displacement d as the position of
the heterozygote relative to the two homozypotes
d = (Aa - aa) / (AA - aa)
• If the effects are purely additive, the heterozygote
genotypic value will be exactly halfway between those of
the homozygote, I.e. d=1/2.
• The dominance variance is the variance due to
dominance deviations from additivity and equals
2D = p2 (AA - 2A)2 + 2pq (Aa - A -a )2 + q2 (aa - 2a)2
21
Environmental effects on quantitative phenotypes
• Environmental factors are responsible for within genotype
variance. The simplest way to account for environmental variance is
to aggegate all unmeasured effects on the phenotype, usually
assuming that they have a normal distribution.
• We can now express the determination of the
phenotype as a sum of additive genetic, dominance, and
environmental effects
=A+D+E
with variance
2 = 2A + 2 D + 2E
• The environmental effects can ge additive, I.e. act
similarly on each genotype, or there can be a genotype by
environment (G E) interaction if the same
environmental exposure affects different genotypes
22
differently (add 2GE to the above equation).
Kinship and inbreeding coefficients: probabilities of
shared genes
Several quantities are used to measure the genetic
relationship between two individuals.
• The coefficient of kinship, FXY , between individuals X and
Y, is the probability that two alleles at the same locus, one
chosen randomly from each individual, are identical by
descent (ibd) from some common ancestor.
• The inbreeding coefficient , F, is the probability that
his/her two alleles at a locus are ibd. This equals the
kinship coefficient of its parents.
• The coefficient of relationship, r = 2 FXY , is the fraction of
genes shared ibd by two individuals.
• Table 6.2 gives kinship F coefficients for various
important kinds of relative pair.
23
Table 6.2 (Weiss) Genetic relationships among various
types of relative
Coefficient of
Relative type
MZ twins
Parent-offspring
Full sibs / DZ twins
Half-siblings
Avuncles*
Half-avuncles*
First cousins
Double first cousins
Half first cousins
1st cousins once rem
Second cousins
Degree of
relationship
__
1st
1st
2nd
2nd
3rd
3rd
2nd
4th
4th
5th
Kinship (F) Relationship(r)
__
¼
¼
1/8
1/8
1/16
1/16
1/8
1/32
1/32
1/64
1
½
½
¼
¼
1/8
1/8
¼
1/16
1/16
1/3
* Avuncles refers to uncle/aunt-nephew/niece pairs
24
Genotypic correlation between relatives
Consider the genotypic values of parents and offspring, for
an additive diallelic locus. See Table 6.3.
• For a locus with three genotypes, there are nine possible
parent-offspring genotype pairs.
Example: First row of table.
• The probabilities of an AA father and an AA, Aa, or aa child
are p, (1-p), 0 respectively, because:
Note that all offsprings receive an A from father with
probability 1, so offsprings cannot have genotype aa.
All offsprings receive an A from the father, and an A from the
mother with prob p (making their genotype AA); or an a from
their mother with prob 1-p (making their genotype Aa).
25
Table 6.3. Parent-offspring relationships
Geno
Parent_________ dose
Genotype Prob value
AA
Aa
aa
p2
2p(1-p)
(1-p) 2
2
1
0
Offspring___________________
Genotype Probs
AA Aa aa
2
1
0
Tot
Mean
p 1-p
0
1.0
p+1
p/2 ½ (1-p)/2) 1.0
p+1
0
p
1-p
1.0
p+1
From this table, the covariance between parent
(P) and offspring (O) can be calculated from all
the values in the table to arrive at
Cov(P,O) = p(1-p) = ½ 2g
Recall:
2g = 2pq; and
g = 2p
26
Table 6.4 (Weiss) Components of genetic covariance for
various types of relative
Coefficient of
Relative type
MZ twins
Full sibs / DZ twins
Parent-offspring
Mid-parent-offspring
Half-siblings
Avuncles*
Double first cousins
First cousins
General
2A
1
½
½
½
¼
¼
¼
1/8
r
2D
1
¼
1/16
u
*Avuncles refers to uncle/aunt-nephew/niece pairs
27
The covariances between any pair of relatives, P and Q, can
be expressed as a weighted combination of additive and
dominance effects.
Let the parents of P be denoted by A nd B.
• Let the parents of Q be denoted by C and D.
Cov(P,Q) = rPQ2A + uPQ2D
where
uPQ = FAC FBD + FAD FBC
F values are kinship coeficients given in Table 6.2.
28
Extension to multiple loci: polygenic traits
Fisher, 1918, showed that the single-locus genetic relationships among
relatives were preserved for multiple additive loci.
Example:
• At a single locus, there are 3 genotypes (AA, Aa, aa) and three
genotypic dose values (0, 1, and 2).
• At two such loci, there are nine genotypes (aabb, aabB, aaBB,
aAbb,aAbB, aABB, AAbb, AAbB, AABB) and 5 different genotypic values
(0, 1, 2, 3, 4).
• In general, for n such loci there are 3n genotypes and 2n+1 genotypic
values, i.e., as n gets large, the distribution of additive genotypic values
resembles the continuous distribution of a quantitative trait.
In practice, the distribution of summed additive effects can be
approximated by a normal distribution. The genotypic correlations
between relativesalso hold for multiple additive loci.
29
Extension to multiple loci: polygenic traits (con’t)
• Dominance refers to non-additive (interaction) effects between alleles
at the same locus.
• Epistasis refers to interactions among alleles at different loci. This
adds another term to the expression for the determination of the
phenotype
= PG + E = A + D + I + E
with variance
2 = 2PG + 2E = 2A + 2D+ 2I + 2E
which can be rewritten as
1 = 2PG /2 + 2E /2
Define heritability as
h2 = 2PG /2 . Heritability represents the ratio
of the observed phenotypic correlation to the theoretical genotypic
correlation.
In twins:
h2 = (2DZ - 2MZ ) / 2DZ
30