Human Genetics - Home | Banff International Research Station

Download Report

Transcript Human Genetics - Home | Banff International Research Station

Human Genetics
Genetic Epidemiology
1
Family trees can have a lot of nuts
2
Genetic Epidemiology - Aims
1. Gene detection
2. Gene characterization
mode of inheritance
allele frequencies
→ prevalence, attributable risk
3
Genetic Epidemiology - Methods
• Aggregation
• Segregation
• Co-segregation
• Association
4
Segregation
affected and unaffected
or
two distributions:
determined by a dominant or recessive allele
Also possible: three distributions:
Can the dichotomy or trichotomy be explained by
Mendelian segregation?
5
Likelihood (parameter(s); data)
 Probability (data | parameter(s))
The joint probability of the genotypes and phenotypes of all
the members of a pedigree can be written as

P (Gi )
founders i

P (G j | G f j , Gm j )
nonfounders j

P (Y | G )
observed
L( ;Y )
 
G1
G2
 
P (Gi )
Gn founders i



P (G j | G f j , Gm j )
nonfounders j
P (Y | G ).
observed
6
Transmission Probabilities
Value if there is Mendelian
segregation
P(AA transmits A) = τ AA
A
1
P(Aa transmits A) = τ Aa
A
½
P(aa transmits A) = τ aa
A
0
7
Ascertainment
• We examine segregating sibships
• The proportion of sibs affected is larger
than expected on the basis of
Mendelian inheritance
• The likelihood must be conditional on the
mode of ascertainment
• We need to know the proband sampling
frame
8
Cosegregation
• Chromosome segments are transmitted
• Cosegregation is caused by linked loci
ultimate statistical proof of genetic etiology
9
Methods of Linkage Analysis
• Trait model-based – assume a genetic model
underlying the trait
(parametric)
• Trait model-free - no assumptions about the
genetic model underlying the trait
(non-parametric)
• Ascertainment is often not an issue for locus
detection by linkage analysis
10
Model-based Linkage Analysis
• If founder marker genotypes are known or can
be inferred exactly,
→ no increase in Type 1 error
→ smallest Type 2 error when the
model is correct
• If founder marker genotypes are unknown, we can
1) estimate them
2) use a database
• All parameters other than the recombination
fraction are assumed known
11
L( ; Y )
 
G1
 
G2
P (Gi )
Gn founders i



P (G j | G f j , Gm j )
nonfounders j
P (Y | G ).
observed
P (G j | G f j , Gm j ) is expressed as a function of
2-locus transmission probabilities
 AB
ab
and
 AB
ab
 AB
 Ab
  AB
ab
  AB
ab
ab
aB


(1   )
2

2
12
Model-free Linkage Analysis
Identity-in-state versus Identity-by-descent
Two alleles are identical by descent if they are copies
of the same parental allele
A1A1
A1A2
A1A2
A1A2
IBD
13
Sib pairs share
0, 1 or 2
alleles identical by
descent at a marker locus
0, 1 or 2
alleles identical by
descent at a trait locus
Linkage
The average proportion shared at any particular
locus is 1/2
14
Relative Pair Model-Free Linkage Analysis
• We correlate relative-pair similarity (dissimilarity)
for the trait of interest with relative-pair
similarity (dissimilarity) for a marker
• Linkage between a trait locus and a marker locus
→ positive correlation
• Affected relative pair analysis: Do affected relative
pairs share more marker alleles than expected if
there is no linkage?
• No controls!
15
Association
• Causes of association between a marker and a disease
•
•
•
•
chance
stratification, population heterogeneity
very close linkage
pleiotropy
16
Causes of Allelic Association
Heterogeneity/stratification
Simpson's paradox: If we mix two populations that have
both different disease prevalence and different marker
allele prevalence, and there is no association between the
disease and marker allele in each population, there will
be an association between the disease and the marker
allele in the mixed population.
This allelic association is nuisance association
The best solution to avoid this confounding is to study only
ethnically homogeneous populations
17
(Tight) Linkage
Imagine a number of generations ago, a normal allele d
mutated to a disease allele D on a particular chromosome
on which the allele at a marker locus was A1
mutation
A1
d
A1
D
This chromosome is passed down through the generations,
and now there are many copies. If the distance between D
and A1 is small, recombinations are unlikely, so most D
chromosomes carry A1
This is the type of allelic association we are interested in
Guarding Against Stratification
• Three solutions:
• use a homogenous population
• use family-based controls
• use genomic control
19
Matching on Ethnicity
• Close relatives are the best controls, but
can lead to overmatching
• Cases and control family members must
have the same family history of disease
Siblings
Cousins
20
Transmission Disequilibrium Test
(TDT)
• A design that uses pseudosibs as controls
• Cases and their parents are typed for markers
A1A2
A2A2
A 1A2
Transmitted genotype is A1A2
Untransmitted genotype is A2A2
Father transmits A1, does not transmit A2
Mother transmits A2, does not transmit A2
(uninformative in terms of alleles)
21
• Build up a 2 x 2 table:
Untransmitted
Transmitted
A1
A2
A1
A2
•
Transmitted
A1
A2
Untransmitted
A1
a
b
A2
c
d
• The counts a and d come from homozygous parents
• The counts b and c come from heterozygous parents
(b - c)2
• McNemar's test :
χ12
b+c
22
Genomic Control
• Calculate an association statistic for a
candidate locus
• Calculate the same association statistic,
from the same sample, for a set of
unlinked loci
• Determine significance by reference to the
results for the unlinked loci
23
Linkage Between
a Marker and a Disease
• Intrafamilial association
• Typically no population association
• Not affected by population stratification
• Population association if very close
24
Association versus Linkage
Allelic Association
Linkage
• Association at the
population level
Intrafamilial association
•
•
•
•
Pinpoints loci
Less powerful
Fewer tests required
Less sensitive to
mistyping
Not sensitive to population
stratification
Pinpoints alleles
More powerful
More tests required
More sensitive to
mistyping
• Sensitive to population
stratification
• Which is better?
25
What is the Best Design and Analysis?
• If heterogeneity / stratification is a non-issue,
unrelated cases and controls for association
analysis
(genome scan?)
• If heterogeneity / stratification could be an issue,
genome scan desired,
large extended pedigrees, type all (founders and nonfounders) for 200-400 equi-spaced markers, for linkage
analysis
Note: cost, burden of multiple testing
A wise investigator, like a wise investor, would hedge
bets with a judicious mix
26
Case-Control Data
• Consider a particular marker allele, A1, sample of cases and
controls:
Number of A1 alleles
Cases
Controls
Total
0
r0
s0
n0
1
r1
s1
n1
2
r2
s2
n2
Total
R
S
N
27
• Consider the probability structure:
Number of A1 Alleles
0
1
2
Cases
p0
p1
p2
Controls
q0
q1
q2
• Cochran-Armitage trend: test the null hypothesis
p2 + ½p1 = q2 + ½q1
without assuming the two alleles a person has are
independent
Sasieni (1997) Biometrics 53:1253-1261
28
2
(pˆ 2 + pˆ 1) - (qˆ 2 + qˆ 1)
Y1 =
2
 1 1  1    1
 1
  
 +  2   N  n 1+ n 2  -  n 1+ n 2   
 2
  
 R S  N    4
1
2
1
2
asymptotically has a χ2 distribution with 1 d.f
29
Cochran-Armitage Trend Test
• Does not assume independence of alleles within a
person
• Does assume independence of genotypes from
person to person
• Is not valid if there is population stratification
• The increased variance due to stratification can be
estimated from a random set of markers that are
independent of the disease
genomic control.
Devlin and Roeder (1999) Biometrics 55:997-1004
30
Case-only Studies
• Look at departure from
A1A1
p2
A1A*
2p(1-p)
A*A*
(1-p)2
where p = P(A1) = p2 + ½p1
• Suggested as
• more powerful (only cases needed)
• more precise (signal decreases faster with distance
from the causative locus)
• Hardy-Weinberg Disequilibrium (HWD) test statistic:
2
2ù
1 ˆ
ép
ˆ
ˆ
(p
+
p
)
2
1
2
ê
ú
ë 2
û ® χ2
1
estimated variance
31
Case - only Studies
• No power in the case of a multiplicative model
P(affected | A1A* )  P(affected | A1A1 ) P(affected | A*A* )
• No controls
• there must be a difference in HWD between
cases and controls
• therefore we consider this HWD trend test:
2
2
 p
1 ˆ
1 ˆ
ˆ
ˆ
ˆ
ˆ



-(p
+
p
)
q
-(q
+
q
)
  2
2
2
2
1
2
2
1


Y2 = 
estimated variance





2
32
ˆ
b²
Y1 =
ˆ
var(b)
ˆ
d²
Y2 =
ˆ
var(d)
33
Weighted average of the CochranArmitage trend test and the HWD trend
test statistics
2
 w | bˆ | (1  w) | dˆ |


Y 2
ˆ  (1  w)2 var(d)
ˆ  2w(1  w)cov(| bˆ |,| dˆ |)
w var(b)
We want to give more weight to b or d, whichever yields
the larger signal
Therefore take
w
Y1
Y1  Y2
34
• To investigate the null distribution of this
average we simulate many different situations –
sample sizes up to 10,000 cases and 10,000
controls - and generate
pˆ 0 ,pˆ 1 ,pˆ 2 for cases and qˆ 0 ,qˆ 1 ,qˆ 2 for controls
• For
all situations considered, the distribution is well
approximated by a Gamma distribution
35
• As the sample size and marker allele
frequency increase, the largest mean and the
smallest variance occur for 10,000 cases and
10,000 controls, and for a marker allele
frequency 0.5
• For 10,000 cases and 10,000 controls, and
marker allele frequency 0.5, the upper tail of the
distribution is well approximated by a Gamma
distribution with mean μ = 1.78 and variance σ2 =
3.45
36
• We develop a prediction equation to
determine percentiles of the null
distribution for smaller sample sizes and
marker allele frequencies
• We base goodness of fit on the root mean
squared error (RMSE) of logeα, calculated
for various sample size combinations, from
the variance among 50 replicate samples:
1

ˆ 2
RSME =   (logeα - logeα)
 50

1
2
37
• With ~90% confidence, the true loge α lies in the
interval logeα + 1.645(RSME), i.e., α is within
e+1.645(RSME) - fold of the true α
• For total sample size (R + S) 200 or larger and α =
0.0001 or larger, in the very worst case (R = S =
100, α = 0.0001) with 90% confidence α could
differ from the true α by a factor of at most ~ 4.8
• The average RMSE is 0.35, corresponding to being
between 78% and 122% of the true α with 90%
confidence
38
POWER
Genetic Models Simulated
Probability of being affected given
A1 A1
A1 A*
A* A*
1 Recessive 1
1.00
0.10
0.10
2 Recessive 2
1.00
0.05
0.05
3 Additive
1.00
0.50
0.00
4Multiplicative
0.81
0.045
0.0025
• Each simulated population contains 500,000 individuals
allowed to randomly mate for 50 generations after the
appearance of a disease mutation
• Marker loci placed at distances 0 – 6 cM from the disease
susceptibility locus
• For type I error, no association between the disease and
marker loci
39
Tests Performed
Homogeneous populations
• HWD, cases only
• Allele test
• Allele test x HWD in cases
• HWD trend test
• Cochran-Armitage trend test
• Cochran-Armitage trend test x HWD trend test
• Weighted average
Population stratification
• Cochran-Armitage trend test with genomic control
• Product of this and the HWD trend test
• Weighted average with genomic control
40
Type I error, homogeneous population
∆
HWD test, cases only
▲
product of the allele test and HWD test
41
Type I error, population stratification
○
◊
allele test
Cochran-Armitage trend test
▲ product of the allele test and HWD test
■ weighted average test
● product of the Cochrn-Armitage trend test and the HWD test
42
Power, homogeneous population
43
■
weighted average test
Power, population stratification
□ HWD trend test
♦ CA test with genomic control
■ weighted average with genomic control
44
Conclusions
• Under recessive inheritance, the weighted average
has better performance than either the CochranArmitage trend test or the HWD trend test
• Has good performance for other models as well
• The product of the Cochran-Armitage trend test
statistic and the HWD test statistic (cases only) has
better power, but has inflated Type I error if there is
population stratification
• The weighted average has good overall properties,
automatically controls for marker mistyping
45
With acknowledgment to
Kijoung Song
46
Can we use evolutionary models, when we
have large amounts of genetic data on a
sample of cases and controls, to obtain a
more powerful way of detecting loci
involved in the etiology of disease?
Will these models bear fruit or nuts?
47