On the relationship between penetrance-model
Download
Report
Transcript On the relationship between penetrance-model
Introduction to
some basic concepts
in quantitative genetics
Harald H.H. Göring
Course “Study Design and Data Analysis for Genetic Studies”, Universidad ded Zulia,
Maracaibo, Venezuela, 6 April 2005
“Nature vs. nurture”
trait
environment
genes
0%
environmental contribution
100%
genetic contribution
“mendelian” traits
“complex” traits
100%
0%
infections,
accidental injuries
“Marker” loci
There are many different types of polymorphisms, e.g.:
• single nucleotide polymorphism (SNP):
AAACATAGACCGGTT
AAACATAGCCCGGTT
• microsatellite/variable number of tandem repeat (VNTR):
AAACATAGCACACA----CCGGTT
AAACATAGCACACACACCGGTT
• insertion/deletion (indel):
AAACATAGACCACCGGTT
AAACATAG--------CCGGTT
• restriction fragment length polymorphism (RFLP)
…
Genetic variation in numbers
There are ~6 x 109 humans on earth, and thus ~12 x 109 copies of
each autosomal chromosome. Assuming a mutation rate of ~1 x
108, every single nucleotide will be mutated (~12 x 109) / (~1 x 108)
= ~120 in each new generation of earthlings. Thus, every
nucleotide will be polymorphic in Homo sapiens, except for those
where variation is incompatible with life.
Any 2 chromosomes differ from each other every ~1,000 bp. The 2
chromosomal sets inherited from the mother and the father (each
with a length of 3 x 109 bp) therefore differ from each other at ~3 x
109 / ~1,000 = ~ 3 x 106, or ~3 million, locations.
Definitions of some important terms
locus:
a position in the DNA sequence, defined relative to others; in
different contexts, this might mean a specific polymorphism or
a very large region of DNA sequence in which a gene might be
located
gene:
the sum total of the DNA sequence in a given region related to
transcription of a given RNA, including introns, exons, and
regulatory regions
polymorphism:
the existence of 2 or more variants of some locus
allele
the variant forms of either a gene or a polymorphism
neutral allele:
any allele which has no effect on reproductive fitness; a neutral
allele could affect a phenotype, as long as the phenotype itself has
no effect on fitness
silent allele:
any allele which has no effect on the phenotype under study; a
silent allele can affect other phenotype(s) and reproductive fitness
disease-predisposing allele: any allele which increases susceptibility to a given disease;
this should not be called a mutation
mutation:
the process by which the DNA sequence is altered, resulting in a
different allele
Genetics vs. epidemiology:
aggregate effects
•
•
•
The sharing of environmental factors among related (as well as
unrelated) individuals is hard to quantify as an aggregate.
In contrast, the sharing of genetic factors among related (as well as
unrelated) individuals is easy to quantify, because inheritance of
genetic material follows very simple rules.
Aggregate sharing of genetic material can therefore be predicted fairly
accurately w/o measurements: e.g.
– a parent and his/her child share exactly 50% of their genetic material
(autosomal DNA)
– siblings share on average 50% of their genetic material
– a grandparent and his/her grandchild (or half-sibs or avuncular individuals)
share on average 25% of their genetic material
•
genome as aggregate “exposure”: While it is not clear whether an
individual has been “exposed” to good or bad factors, “co-exposure”
among relatives is predictable.
Use of genetic similarity of relatives
• The genetic similarity of relatives, a result of inheritance of
copies of the same DNA from a common ancestor, is the basis
for
–
–
–
–
–
heritability analysis
segregation analysis
linkage analysis
linkage disequilibrium analysis
relationship inference
• between close relatives (e.g., identification of human remains, paternity
disputes)
• between distant groups of individuals from the same species (e.g.,
analysis of migration pattern)
• between different species (e.g., analysis of phylogenetic trees)
– identification of conserved DNA sequences through sequence
alignment
– …
Relatives are not i.i.d.
• Unlike many random variables in many areas of
statistics, the phenotypes and genotypes of related
individuals are not independent and identically
distributed (i.i.d.).
• Many standard statistical tests can and/or should
therefore not be applied in the analysis of relatives.
• Most analyses on related individuals use likelihoodbased statistical approaches, due to the modeling
flexibility of this very general statistical framework.
“Mendelian” vs. “complex” traits
“simple mendelian” disease
“complex multifactorial” disease
•genotypes of a single locus cause
disease
•genotypes of a single locus merely
increase risk of disease
•often little genetic (locus) heterogeneity •genotypes of many different genes (and
(sometimes even little allelic
various environmental factors) jointly and
heterogeneity); little interaction between often interactively determine the disease
genotypes at different genes
status
•often hardly any environmental effects •important environmental factors
•often low prevalence
•often high prevalence
•often early onset
•often late onset
•often clear mode of inheritance
•no clear mode of inheritance
•“good” pedigrees for gene mapping can •not easy to find “good” pedigrees for
often be found
gene mapping
•often straightforward to map
•difficult to map
Genetic heterogeneity
locus homogeneity,
allelic homogeneity
time
locus homogeneity,
allelic heterogeneity
locus heterogeneity,
allelic homogeneity
(at each locus)
locus heterogeneity,
allelic heterogeneity
(at each locus)
time
Study design
different traits
different study designs
different analytical methods
How to simplify the etiological
architecture?
•
choose tractable trait
–
Are there sub-phenotypes within trait?
•
•
•
–
“endophenotype” or “biomarker ” vs. disease
•
•
•
•
•
age of onset
severity
combination of symptoms (syndrome)
quantitative vs. qualitative (discrete)
Dichotomizing quantitative phenotypes leads to loss of information.
simple/cheap measurement vs. uncertain/expensive diagnosis
not as clinically relevant, but with simpler etiology
given trait, choose appropriate study design/ascertainment protocol
–
study population
•
•
–
“random” ascertainment vs. ascertainment based on phenotype of interest
•
•
•
•
–
single or multiple probands
concordant or discordant probands
pedigrees with apparent “mendelian” inheritance?
inbred pedigrees?
data structures
•
–
genetic heterogeneity
environmental heterogeneity
singletons, small pedigrees, large pedigrees
account for/stratify by known genetic and environmental risk factors
Qualitative and quantitative traits
• qualitative or discrete traits:
– disease (often dichotomous; assessed by
diagnosis): Huntington’s disease, obesity,
hypertension, …
– serological status (seropositive or seronegative)
– Drosophila melanogaster bristle number
• quantitative or continuous traits:
– height, weight, body mass index, blood pressure,
…
– assessed by measurement
discrete trait
continuous trait
(e.g. hypertension)
(e.g. blood pressure)
0
1
Why use a quantitative trait?
Why not?
Pros and cons of
disease vs. quantitative trait
disease
continuous trait
•
for rare disease, limited variation in
random sample; need for nonrandom ascertainment
•
sufficient variation in random sample;
non-random ascertainment may not
be necessary or advisable
•
for late-onset diseases, it is
difficult/impossible to find
multigenerational pedigrees
•
as no special ascertainment is
necessary, any pedigree is suitable
•
measurement: often straight-forward,
reliable
•
medications and other covariates
may influence phenotype
•
often only of limited/indirect clinical
interest
•
often simpler etiologically
•
•
diagnosis: often difficult, subjective,
arbitrary
treatment may cure disease or
weaken symptoms, but original
disease status is generally still
known
•
of great clinical interest
•
often more complex etiologically
probability density
Dichotomizing quantitative phenotypes
generally leads to loss of information
phenotype
unaffected
affected
Characterization of a quantitative trait
center of distribution
X
mean : ˆ E X
i
n
spread around center
symmetry
thickness of tails
var iance : ˆ 2 E X E X
2
3
skewness : E X E X
4
kurtosis : E X E X
Xi ˆ
2
n
Xi ˆ
3
n
Xi ˆ
4
n
How can a continuous trait result from
discrete genetic variation?
Suppose 4 genes influence the trait, each with 2 equally
frequent alleles. Assume that at each locus allele 1
decreases the phenotype of an individual by 1, and that
allele 2 increases the phenotype by 1.
Now, let us obtain a random sample from the population
- by coin tossing. Take 2 coins and toss them. 2 tails
mean genotype 11, and phenotype of -2. 2 heads mean
genotype 22, and phenotype contribution of +2. 1 head
and 1 head is a heterozygote (genotype 12), with
phenotype of 0. Repeat this experiment 4 times (once
for each locus). Sum up the results to obtain the overall
phenotype.
Variance decomposition
2
p
phenotypic variance
due to all causes
2
g
phenotypic variance
due to genetic
variation
2
e
phenotypic variance
due to
environmental
variation
Decomposition of phenotypic variance
attributable to genetic variation
2
g
phenotypic variance
due to genetic
variation
2
a
phenotypic variance
due to additive
effects of genetic
variation
2
d
phenotypic variance
due to dominant
effects of genetic
variation
phenotypic means of genotypes
AA
-a
0
AB
BB
d
+a
2pq a d(q p)
2
a
2
2 pqd
2
d
2
phenotypic means of genotypes
AA
AB
BB
-a
d=0
+a
If the phenotypic mean of the heterozygote is half
way between the two homozygotes, there is “doseresponse” effect, i.e. each dose of allele B
increases the phenotype by the same amount. In
this case, d = 0, and there is no dominance
(interaction between alleles at the same
polymorphism).
2
2
2
d
2pqd 2pq 0 0
Decomposition of phenotypic variance
attributable to environmental variation
2
e
phenotypic variance
due to
environmental
variation
2
c
phenotypic variance
due to
environmental
variation common
among individuals
(e.g., culture,
household)
2
u
phenotypic variance
due to
environmental
variation unique to
an individual
Definition of heritability
The proportion of the phenotypic
variance in a trait that is attributable to
the effects of genetic variation.
ˆ
ˆh 2
ˆ
2
g
2
p
The absolute values of variance
attributable to a specific factor are not
important, as they depend on the
scale of the phenotype. It is the
relative values of variance matter.
Broad sense and
narrow-sense heritability
The proportion of the phenotypic variance in a trait
that is attributable to:
- effects of genetic variation (broad sense)
2
ˆ
ˆh 2 g
2
ˆ
p
- additive effects of genetic variation (narrow sense)
2
ˆ
a
2
ˆ
h 2
ˆ p
“Nature vs. nurture”
trait
environment
genes
0%
environmental contribution
100%
genetic contribution
100%
0%
Different degrees of relationship have
different phenotypic covariance/correlation
relative pair
parent child
full sibs
half sibs
first cousins
phenotypic
covariance
1 2
a
2
1 2 1 2
a d
2
4
1 2
a
4
1 2
a
8
phenotypic
correlation
1 2
h
2
1 2
h
2
1 2
h
4
1 2
h
8
(assuming absence of effect of shared environment)
MZ and DZ twins have
different phenotypic covariance/correlation
relative pair
phenotypic
covariance
2
2
2
identical twins a d ( c )
fraternal twins
2x difference
1 2 1 2
a d ( c2 )
2
4
3 2
d
2
2
a
phenotypic
correlation
h2
1 2
h
2
h2
(assuming
equal effect of
shared
environment)
Normal distribution
f(x)
f x
1
e
2
1 ( x )2
2 2
x
Variance components approach:
multivariate normal distribution (MVN)
In variance components analysis, the phenotype is generally
assumed to follow a multivariate normal distribution:
f x
1
2 ž
n
1
2
exp
1
x ' ž
2
1
x
n
1
1
'
ln f x ln 2 ž x ž
2
2
2
no. of individuals
(in a pedigree)
nn covariance
matrix
phenotype
vector
1
x
mean
phenotype
vector
Variance-covariance matrix
The variance-covariance matrix describes the
phenotypic covariance among pedigree members.
ž ž i
2
i
i
nn
structuring
matrix
scalar variance
component
(random effect)
“Sporadic” model:
no phenotypic resemblance
between relatives
In the simplest model, the phenotypic covariance
among pedigree members is only influenced by
environmental exposure unique to each individual.
Shared factors among relatives, such as genetic and
environmental factors, do not influence the trait.
2
u
ž I
1 2 ... n
identity matrix:
1 1
2 0
I
... 0
n 0
0 0 0
1 0 0
0 1 0
0
1
Identity matrix
m
f
1
2
3
f
m
1
2
3
f
1
0
0
0
0
m
0
1
0
0
0
1
0
0
1
0
0
2
0
0
0
1
0
3
0
0
0
0
1
Modeling phenotypic
resemblance between relatives:
“polygenic” model
ž I 2
2
u
2
a
kinship matrix
Kinship and relationship matrix
kinship matrix:
Each element in the kinship matrix contains probability that
the allele at a locus randomly drawn from the 2
chromosomal sets in a person is a copy of the same allele at
the same locus randomly drawn from the 2 chromosomal
sets in another person. For one individual, f = 0.5, assuming
absence of inbreeding.
relationship matrix:
2
This provides the probability that a given locus is shared
identical-by-descent among 2 individuals. This is equivalent
to the expected proportion of the genome that 2 individuals
share in common due to common ancestry. For one
individual, 2f = 1, assuming absence of inbreeding.
Relationship matrix and D7 matrix
relationship
self
MZ twin pair
DZ twin pair
full sibs
half sibs
grandparent grandchild
avuncular
first cousin
second cousin
2
D7
1
1
0.5
1
1
0.25
0.5
0.25
0.25
0.25
0
0
0.25
1/8
1/32
0
0
0
Relationship matrix:
nuclear family
m
f
1
2
3
f
m
1
f
1
0
0.5 0.5 0.5
m
0
1
0.5 0.5 0.5
1
2
1
0.5 0.5
2
0.5 0.5 0.5
3
0.5 0.5 0.5 0.5
3
0.5 0.5
1
0.5
1
Relationship matrix:
half-sibs
f1
m
1
f2
2
f1
m
f2
1
2
f1
1
0
0
0.5
0
m
0
1
0
0.5 0.5
f2
0
0
1
0
0.5
0
1
0.25
1
2
0.5 0.5
0
0.5 0.5 0.25
1
Likelihood
• The likelihood of a hypothesis (e.g. specific parameter value(s))
on a given dataset, L(hypothesis|data), is defined to be
proportional to the probability of the data given the hypothesis,
P(data|hypothesis):
L(hypothesis|data) = constant * P(data|hypothesis)
• Because of the proportionality constant, a likelihood by itself has
no interpretation.
• The likelihood ratio (LR) of 2 hypotheses is meaningful if the 2
hypotheses are nested (i.e., one hypothesis is contained within
the other):
LH1 | data cPdata| H1 P data| H1
LR
LH0 | data cPdata| H0 P data| H0
• Under certain conditions, maximum likelihood estimates are
asymptotically unbiased and asymptotically efficient. Likelihood
theory describes how to interpret a likelihood ratio.
Inference in heritability analysis
H0: (Additive) genetic variation does not
contribute to phenotypic variation
H1: (Additive) genetic variation does
contribute to phenotypic variation
L H 0
2 ln
L H1
heritability:
2
%
L 0, u
2 ln
2
2
ˆ
ˆ
L a , u
2
a
2
ˆ
h
2
ˆ
a
2
2
ˆ
ˆ
a u
Modeling phenotypic
resemblance between relatives:
“polygenic” model allowing for
dominance
ž I 2 D 7
2
u
2
a
2
d
matrix of probabilities
that 2 individuals
inherited the same alleles
on both chromosomes
from 2 common
ancestors
Relationship matrix and D7 matrix
relationship
self
MZ twin pair
DZ twin pair
full sibs
half sibs
grandparent grandchild
avuncular
first cousin
second cousin
2
D7
1
1
0.5
1
1
0.25
0.5
0.25
0.25
0.25
0
0
0.25
1/8
1/32
0
0
0
D7 matrix:
nuclear family
m
f
1
2
3
f
m
1
2
3
f
1
0
0
0
0
m
0
1
0
0
0
1
0
0
1
2
0
0
0.25
3
0
0
0.25 0.25
0.25 0.25
1
0.25
1
Inference in heritability analysis
H0: (Additive) genetic variation does not
contribute to phenotypic variation
H1: (Additive) genetic variation does
contribute to phenotypic variation
L H 0
2 ln
L H1
L 0, 0, %
2 ln
2
2
2
ˆ
ˆ
ˆ
L a , d , e
2
a
2
d
2
e
2 degrees of
freedom
Is it reasonable to assume that the only source
for phenotypic resemblance among relatives is
genetic?
No. To overcome this problem, one can try to
model shared environment, either in aggregate
or broken into specific environmental factors.
ž I H 2
2
u
2
c
2
a
household matrix: accounts for aggregate of
environmental factors shared among individuals
living in the same household
Household matrix
m
f
1
2
3
f
m
1
2
3
f
1
1
1
0
0
m
1
1
1
0
0
1
1
1
1
0
0
2
0
0
0
1
0
3
0
0
0
0
1
“Household” effect
2
ˆ
c
2
cˆ 2
ˆ p
Nested models for heritability analysis
model
“sporadic”
+
-
-
“household”
+
+
-
“additive polygenic”
+
-
+
“general”
+
+
+
2
u
2
c
2
a
non-nested
hypotheses
Inclusion of covariates
Measured covariates can easily be incorporated as
“fixed effects” in the multivariate normal model of the
phenotype, by making the expected phenotype
different for different individuals as a function of the
measured covariates.
n
1
1
'
ln f x ln 2 ž x ž
2
2
2
1
x
overall Y
i overall jYij
j
Inclusion of covariates
If covariates are not of interest in and of themselves,
one can “regress them out” before pedigree analysis.
Xˆ i ˆ 0 ˆ jYij
j
ˆ Y
X
Xi Xˆ i eˆi
ˆ eˆ
XX
Then use residuals as phenotype of interest in
pedigree analysis.
Inference regarding covariates
in heritability analysis
H0: measured covariate Y does not influence phenotype.
H1: measured covariate Y does influence phenotype.
L H 0
2 ln
L H 1
2
2
ˆ
ˆ
L a , u , 0
2 ln
2
2 ˆ
ˆ
ˆ
L , ,
a
u
Inference regarding covariates
in heritability analysis
H0: measured covariate Y does not influence phenotype.
H1: measured covariate Y does influence phenotype.
2
ˆ
L u , 0
L H 0
2 ln
2 ln
2 ˆ
L H1
ˆ
L ,
u
CAUTION:
Related individuals in pedigrees are treated as unrelated.
This can easily lead to false positive findings regarding the
effect of the covariate!
Choice of covariates
Covariates ought to be included in the likelihood model if
they are known to influence the phenotype of interest and
if their own genetic regulation does not overlap the genetic
regulation of the target phenotype.
Typical examples include sex and age.
In the analysis of height, information on nutrition during
childhood should probably be included during analysis.
However, known growth hormone levels probably should
not be.
Choice of covariates
2
a
2
a
2
p
h without cov
0.5
2
2
a
2
p
0.25
2
cov
2
p
2
a2 a2 I cov
2
2p cov
h 2 with cov
Choice of covariates
h
2
a
2
p
2
without cov
0.2 < 0.3
2
a
2
p
2
a
2
cov
2
p
2
a2 a2 I cov
2
2p cov
h 2 with cov
Choice of covariates:
special case of treatment/medication
probability density
Before treatment/medication
of affected individuals
phenotype
unaffected
affected
probability density
After (partially effective) treatment /
medication of affected individuals
phenotype
apparent
effect of
unaffected covariate
affected
Choice of covariates:
special case of treatment/medication
•
If medication is ineffective/partially effective, including treatment as a
covariate is worse than ignoring it in the analysis.
•
If medication is very effective, such that the phenotypic mean of
individuals after treatment is equal to the phenotypic mean of the
population as a whole, then including medication as a covariate has
no effect.
•
If medication is extremely effective, such that the phenotypic mean
of individuals after treatment is “better” than the phenotypic mean of
the population as a whole, then including medication as a covariate
is better than ignoring it, but still far from satisfying.
•
Either censor individuals or, better, infer or integrate over their
phenotypes before treatment, based on information on efficacy etc.
Be careful in
interpretation of heritability estimates
While one can attempt to account for shared
environmental factors individually or in aggregate, it is
notoriously difficult to do so. In contrast to genetics where
“co-exposure” among relatives is predictable due to
inheritance rules, this is not the case with environmental
factors of interest in epidemiology. If environmental coexposure is not adequately modeled, shared
environmental effects tend to inflate the heritability
estimate, because shared exposure is generally greater
among relatives, such as mimicking the effects of genetic
similarity among relatives. Heritability estimates thus are
often overestimates.
Be careful in
interpretation of heritability estimates
Keep in mind that heritability estimates are applicable only
to a specific population at a specific point in time.
Heritability of adult height
(additive heritability, adjusted for sex and age)
study
TOPS
heritability
sample size
estimate
2199
0.78
FLS
705
0.83
GAIT
324
0.88
SAFHS
903
0.76
SAFDS
737
0.92
AZ
643
0.80
DK
675
0.81
OK
647
0.79
616
0.63
SHFS
Jiri
total
7449
Be careful in
interpretation of heritability estimates
Heritability is a population level parameter, summarizing
the strength of genetic influences on variation in a trait
among members of the population. It does not provide any
information regarding the phenotype in a given individual,
such as risk of disease.
Relative risk
The risk of disease (or another phenotype) in a relative of
an affected individual as compared to the risk of disease in
a randomly chosen person from the population.
relationship
prelationship
p
Relative risk as a function of heritability
sib
1 p
1
0.5ha2 0.25hd2
p
p 0 (rare
disease) : sib
p 1 (common disease) : sib 1
Heritability of adult height
(additive heritability, adjusted for sex and age)
autism
p
0.0004
sib
75
IDDM
0.004
15
phenotype
schizophrenia 0.01
0.2
NIDDM
obesity
0.4
9
3
<2
Be careful in
interpretation of heritability estimates
A heritability estimate is applicable only to a specific trait. If
you alter the trait in any way, such as inclusion of
additional/different covariates, this may alter the estimate
and/or alter the interpretation of the finding.
Example:
•left ventricular mass not adjusted for blood pressure
•left ventricular mass adjusted for blood pressure