投影片 1 - Institute of Statistical Science, Academia Sinica

Download Report

Transcript 投影片 1 - Institute of Statistical Science, Academia Sinica

Linkage Analysis I
-- Parametric
2006.3.3
I-Ping Tu
Book reference
• http://www.math.chalmers.se/Stat/Grundut
b/Chalmers/TMS120/kompendium.pdf
• Genetic Linkage Web Resource:
http://linkage.rockefeller.edu/
1 Introduction
• Quality Trait: e.g. tall/short, green/yellow,
affected/unaffected
• Assume Genetic Model 
• parametric linkage analysis
• lod score method
• large pedigrees
• No genetic model assumption
• Nonparametric linkage analysis
• Affected relative pairs
Parametric vs. Non-parametric
linkage analysis
• Parametric
– Assume genetic model known
• Non-parametric
– No assumptions about the genetic model
• The parametric model is more powerful when
the genetic model is correctly specified.
• Problem size limitations
– Parametric – large pedigrees, small number of
markers
– Non-parametric – small pedigrees, many markers
Phenotype
• Binary
– affected or unaffected
– Left handed or right handed
• Affected, unaffected, and unknown
– Unknown – possibly part of the syndrome
• Quantitative
– Insulin resistance
– Blood Pressure
Definitions
• Locus
– Position on a chromosome
– Marker locus
– Disease locus
• Marker
– A measurable unit on a chromosome
– Dinucleotide repeat (CA)n
– Single nucleotide polymorphism(SNP)
• Allele
– The measurement at a marker locus
– 2 alleles per locus (one per chromosome)
Marker alleles
1 and 4
Allelesat the
disease locus
A and a
The recombination fraction Θ
Θ = Probability of recombination between two loci.
Θ = 0.5
if ”large” distance.
Θ < 0.5
if ”short” distanc
An odd number of crossovers = recombination
An even number = no recombination
Haldane’s Mapping function
Recombination fraction – An example
No! Recombination fractions are not additive for large distances.
Penetrance( Gentic Model)
• Probability of being affected
• Penetrance parameters: f = (f0 f1 f2)
Definition: fk = Probability of being affected if you have k disease alleles k=0, 1, 2.
fk = P(affected conditional on k disease alleles)
k=0, 1, 2.
fk = P(affected | k disease alleles) k=0, 1, 2.
Notation:
A = Disease allele
a = Normal allele
Disease genotypes: aa, Aa, or AA
Penetrance continued
Recessive
Dominant
Full p.
Reduced p.
Full p.
Reduced p.
f0 = P(aff| aa)
0
0
0
0
f1 = P(aff | Aa)
0
0
1
0.8
f2 = P(aff| AA)
1
0.7
1
0.8
Dominant with
phenocopies and
reduced penetrance
f0 = 0.01
Additive penetrances
f0 = 0
f1 = 0.8
f1 = 0.4
f2 = 0.8
f2 = 0.8
Age dependent
penetrances
Population prevalence
Kp = Proportion of affected individuals in a population = P(aff)
aa
Aa
= Affected
P(aff | aa)  0.03
P(aff  Aa)
 P(aff | Aa)  0.12
P(Aa)
P(aff | AA)  0.50
Definition of
conditional
probability
AA
Disease allele frequency p = 0.05
Assume that the population is in HWE
P(aa) = (1-p)2 = 0.952 = 0.9025
P(Aa) = 2p(1-p) =0.095
Kp = P(aff) = ?
P(AA) = p2 = 0.0025
Population prevalence contd.
aa
Aa
AA
Kp = Area of the red square / Total area (aa + Aa + AA) =
= P(aff ∩ aa) + P(aff ∩ Aa) + P(aff ∩ AA) =
= P(aff | aa)P(aa) + P(aff | Aa)P(Aa) + P(aff | AA)P(AA) =
= f0*(1-p)2 +f1*2p(1-p) + f2*p2 =
= 0.03*0.9025 + 0.12*0.095 + 0.50*0.0025 = 0.039725
The Law of
Total
Probability
 0.04
Estimation of the genetic model
• Segregation analysis
– It is possible to estimate
•
•
•
•
mode of inheritance
number of loci contributing to a segregating phenotype.
penetrance parameters
Relative frequency (p) of the disease allele in the population
– Problems?
• Large population based samples required
• Ascertainment bias
• In parametric linkage analysis we assume that
the genetic model is known.
2. Parametric two-point
linkage analysis
• Let q be the recombination freq between
the diseased gene and the observed
marker.
– H0: q = 0.5 VS HA: q < 0.5
Estimation of the recombination fraction θ
Example:
N = 4 trios with affected mother and daughter
Assume : that all the 12 individuals have been genotyped for a specific DNA marker
that all the mothers are heterozygous at the marker locus
that mothers and fathers have disease genotypes (Aa) and (aa), respectively
that each daughter has inherited a disease allele from her mother
that parental marker genotypes are not identical
that the phase is known for all the mothers (unrealistic)
Data :
Trio 1-3: No recombination between marker and disease locus
Trio 4:
Recombination between marker and disease locus
Estimate : θ* = 1/4
Estimation of θ continued
• Assume that all meioses can be scored
unequivocally as recombinant or non-recombinant
with regard to a marker locus and a disease locus
• n = Number of meioses
• r = Number of recombinant meioses
Estimate : θ* = r/n
Estimates above 0.5 are not relevant from a biological point of view
Definition: θ * = min(0.5, r/n)
The binomial distribution
The number of recombinants r among n independent meioses
follows a binomial distribution.
The probability of r recombinants out of n is a function of the
recombination fraction θ. Let us denote this function L(θ).
Note that L(θ) is the probability (likelihood) of the observed data if the
recombination fraction is θ.
The maximum likelihood estimate (MLE) of θ is the value θ* for
which L(θ) reaches its maximum.
MLE: θ*= r/n
Lod score history
• Score proposed by Haldane & Smith 1947
• Newton E. Morton analysed the
distribution of the lod score statistic under
various assumptions
• Lod scores below -2 are generally
accepted as significant evidence against
linkage.
– Common in replicating studies.
Likelihood Ratio Test :
 0 : x1 ,..., xn ~ f 0 vs  A : x1 ,..., x N ~ f1
f1 ( x1 ,..., xn )
LN 
f 0 ( x1 ,..., xn )
LN  B  reject  0
Sequential probabilit y ratio test
T  inf LN  ( A, B )
LT  A accept  0
LT  B reject  0
  P0 ( LT  B)  Type I error   P0 ( LT  A)  Type II error (1 -   power)
There is a neat approximat ion between  ,  , A, B

  E01( LT  B )    f 0 ( x1 ,..., xn )  1(T  n, LT  B )dx1 ... dxn
n 0

   f1 ( x1 ,..., xn ) 
n 0

f 0 ( x1 ,..., xn )
1(T  n, LT  B )dx1 ... dxn
f1 ( x1 ,..., xn )
  E11(T  n, Ln  B ) 
n 0
1
1
1
 E11( LT  B )   1    
Ln
B
B
  E11( LT  A)  E0 LT  1LT  A  A1   
approximat e the ineq. by eq.
B    1
A    A

1 A
 B 1 

,  
A
BA
 B  A
More complicated situations
•
•
•
•
Phase Unknown
Marker or Disease gene homozygosity
Reduced penetrane
Varying penetrance
– age, sex, phenotype, diagnostic uncertinty
•
•
•
•
•
Phenocopies
Missing marker data
Extended pedigrees
Pedigree loops
Multilocus genotypes
Recessive mode of inheritance
Prerequisites
•Autosomal recessive inheritance
•100% penetrance
f0=f1=0, f2=1
•No phenocopies
•Nuclear family typed for one informative marker
•All four meioses are informative
More complicated situations
• Reduced penetrane
• Varying penetrance
– age, sex, phenotype, diagnostic uncertinty
•
•
•
•
•
Phenocopies
Missing marker data
Extended pedigrees
Pedigree loops
Multilocus genotypes
Lod score assignment
The pedigree likelihood contd.
g = (G1, G2, G3, G4) in the recessive example.
P(y|g) depends on the penetrance parameters f = (f0, f1, f2)
P(g|θ) depends on disease and marker allele frequencies
Ex: G1 in the recessive example: (1A|2a , 3A|4a)
P(g|θ) = 2pq*2p1p2
for the father
2pq*2p3p4
for the mother
θ2/4
for the affected daughter3
θ2/4
for the affecteddaughter4
P(g|q)
• P(y|g): genetic model
• P(g|q)=PP(gi) PP(gj|gFjgMj)
– i means founder
– j means non-founder
– Genotypes g includes those of marker and
disease genes
– Missing data, multilocus markers…
More on missing marker data
• Good estimates of the allele frequencies
necessary
• Assuming a uniform allele frequency distribution
is usually no good idea
– Bias
– See e.g. Ott (1999)
• Allele frequencies for markers available on Websites.
• Genotype say 50 unrelated controls from the
same population
– Possible to use also alleles from individuals in the
study without introducing bias.
Heterogeneity
• Allelic heterogeneity
– Ex: Different mutations in BRCA1 will lead to
the same phenotype
• Genetic heterogeneity
– Only a proportion of the families in a study
can be explained by one disease locus.
– Test for heterogeneity
•
•
•
•
Smith (1963) - The admixture test
Implemented in HOMOG (a program in the
LINKAGE package)
Estimates the proportion of linked families
Age-dependent penetrance contd.
Assume that a 45 year old
woman comes to the
clinic. What is the odds
that she is a disease gene
carrier?
Odds to be a
diseasegene carrier
indifferent age bands:
<30
1:2
30-39 1:3
Penetrance if
40-49 1:8
aa:
0.0012
Aa:
0.0235
50-59 1:12
0.0235 :
150*0.0012 i.e. about 1:8
60-69 1:27
70-79 1:36
General pedigrees
• The Elston-Stewart algorithm (1971)
– Start at the bottom of the pedigree and solve
the problem for each nuclear family.
– The likelihood for each branch is ’peeled’ on
the individual linking the sub-tree to the part of
the pedigree
Two-point vs. Multipoint Linkage
• Two-point linkage analysis
– Analyze marker-disease co-segregation one locus at
a time
• One two-point lod score for each marker
• IBS-sharing of a marker allele might lead to false positive lod
scores if possible look at haplotypes.
• Multipoint (often sliding n-point)
– Regard the marker positions as fixed
– Vary the location (x) of the disease locus across each
sub-map of n adjacent markers.
– Compare each multilocus likelihood to a likelihood
corresponding to ’x off the map’ ( θ = 0.5).
Software
• Jurg Otts website at Rockefeller University
– http://linkage.rockefeller.edu/soft
• For parametric linkage analysis
– LINKAGE
– FASTLINK
– VITESSE
Linkage Analysis II
--Nonparametric
IBS or IBD
14
42
The affected sibs have one allele in
common (4), but the 4-alleles come
from different parents.
Definition:
Two alleles are said to be identical by state
(IBS) if they are of the same kind.
If two alleles have the same ancestral origin
they are said to be identical by descent
(IBD)
IBS-count:
1
IBS is a weaker concept than IBD
Notation
x
A fixedlocus on the genome
N = N(x) = The number of alleles shared IBD by an affected sib pair at
locus x
Let us first assume that x is the disease locus
ASP linkage analysis
• Collect affected sib pairs
– How many depends on the genetic effect
– Power calculations
• Genotype all 4 members of each pedigree
• Estimate the conditional IBD probabilities
  (z 0 , z1 , z 2 )
• Compare with the IBD probabilities under
the null hypothesis of no linkage:
z H0  (0.25, 0.5, 0.25) (Binomial)
P(N = k) k=0, 1, 2 ?
Possible parental disease locus genotypes
aa
Aa
aa
aa, aa aa, Aa aa, AA 
Aa  Aa, aa Aa, Aa Aa, AA 
AA AA, aa AA, Aa AA, AA 
x
AA
The corresponding genotype probabilities under the assumption of HWE and
independence between the parents are:
q2
2pq
p
2
q 4

2pq  2pq
p 2 q 2
2
p

q2

2pq 3
4p 2q 2
2p3q
p 2q 2 
3 
2p q 
p 4 
This matrix is symmetric so it is sufficient to consider6 different mating types
P(N = k) k=0, 1, 2
Mating type
P(Ci)
C1
aa,aa
q4
C2
Aa,aa
4pq3
C3
Aa,Aa
4p2q2
C4
AA,aa
2p2q2
C5
AA,Aa
4p3q
C6
AA,AA
P4
P(N  0 | 2 aff sibs) 
P(2 affsibs | IBD  0)P(IBD  0)
P(2 aff sibs)
P(IBD  0)  0.25
Before we go on, remember the genetic
model: Recessive disease with f = (0, 0, 1)
6
P(2 aff sibs | IBD  0)   P((2 aff sibs IBD  0) | C i )P(Ci )  1* p 4  p 4
i 1
Why?
Because both affected sibs must have2 disease alleles and these pairs of
alleles must be of different parental origin. ThusP((2 aff sibs| IBD=0)|Ci) = 0
for i = 1-5.
Finally we calculate the denominator P(2 aff sibs).
IBD probabilities for a few genetic models
Table 2.1 page 30 in the compendium
λs= Sibling relative risk = 0.25/z0
(strength of the genetic component)
The Maximum Lod Score (MLS)
Assumptions:
n affected sib pairs
Null hypothesis a marker at2a specific test locus x has been genotyped
1
perfect marker
information
4
H0:
Alternative
H1:
(N = N(x) known)
~ = (0.25, 0.5, 0.25)
~
1
=4 (z
0,
z1, z2) !=(0.25, 0.5, 0.25)
(a fixed
alternative)
Pedigree number i:
Ni = 2
The support for the alternative
hypothesis is
P(Ni  2 | H1 ) Z2
LR i (x; ) 

 4Z2
P(Ni  2 | H 0 ) 0.25
Ex: LR = 4 at the disease locus if z2=1 (recessive disease with full penetranceand no phenocopies)
MLS continued
 Z0
 0.25  4Z0

P(N i  j | H1 )
 Z1
LR i (x;  ) 
 
 2 Z1
P(N i  j | H 0 )
0.5

 Z2
 0.25  4 Z 2

if j  0
if f  1
if f  1
Note: Both the observed IBD-count (j) and the IBD-probabilities Ψdepend on x.
n affected sib pairs
Combined evidence in favor of H1:
# 0 IBD = n0= no(x)
LR(x;  )  LR 1 (x;  ) * LR 2 (x;  ) * ... * LR n (x;  )
# 1 IBD = n1= n1(x)
# 2 IBD = n2= n2(x)
The LOD score
 (4Z 0 ) n0 (2Z1 ) n1 (4Z 2 ) n2
Base10
Z(x; )  log((4Z 0 ) n0 (2Z1 ) n1 (4Z 2 ) n2
 n 0log(4Z 0 )  n1log(2Z 1 )  n 2log(4Z 2 )
MLS continued
The maximum lod score =
max Z(x;  )

is known as the MLS-score
The  correspond ing to the MLS - score is the Maximum likelihood
ˆ of .
estimate 
n 0 /n

ˆ
  n1/n
n /n
 2
the relative frequencie s
Constrained maximization
over Holman’s triangle leads
to increased power.
The derivation is more complicated under incomplete marker
The MMLS-score is defined as the maximum of the MLS-scores over x.
NPL Score
• Example: Half Sib Pair
Xij,t : indicator function for i-th pair shares j copy of IBD allele
X1,t = SiXi1,t , l= recombination rate, t : trait locus
P(Xi1,t |affected half sib)=(1+e-2l|t-t| )/2
Log-Likelihood = Xtlog(1+)+(N-Xt)log(1-
Score Statistic for testing H0:  0 is X1,t
For
t unknown, we use maxtYt ,, Yt =X1,t
Remark: Yt is a Markov Chain
The NPL Score
NPL = Non Parametric Linkage
Before we define the score let us repeat the definitions of expectation and variance :
2
Expectatio n :
μ N  E(N)  Expected value of N   k * P(N  k)
k 0
Ex :
Variance :
EX :
E(N)  0 * Z0  1* Z1  2 * Z 2  Z1  2 Z 2
V(N)  E((N - μ N ) 2 )  E(N 2  μ N  2 N μ N )  E(N 2 )  E (N) 2
2
V(N)  0 * Z0  1* Z1  4 * Z 2  ( Z1  2Z 2 ) 2
Under the null hypothesis of no linkage
Under H 0 :
  (Z 0 , Z1 , Z 2 )  (0.25,0.5,0.25)
E(N)  z1  2z 2  0.5  2 * 0.25  1
V(N)  E(N 2 )  E(N) 2  z1  4z 2  12  0.5  4 * 0.25  1  0.5
The NPL score continued
Definition :
Under H 0 :
SD(N)  Standard deviation of N  V(N)
σ N  SD(N)  0.5
Standardiz ation : Z 
N -μN
σN
has expectatio n 0 and standard deviation 1.
For the i : th sib pair define the NPL family score :
 2

Zi   0

 2
with probabilit y z 0
wit h probabilit y z1
with probabilit y z 2
Note:
Ni -1
Zi 
 2 ( N i - 1)
0.5
E(Zi) = 0 underH0
E(Zi) > 0 under H1
The NPL score at a locus x
1
Z(x) 
n
Properties:
n
2
Zi (x) 
(n 2 (x) - n 0 (x))

n
i 1
E( Z(x) ) = 0 under H0
V( Z(x) ) = 1 under H0
Large NPL scores lead to rejection of H0
E( Z(x) ) > 0 under H1
E( Z(x) ) increases with the sample size under H1