Transcript eskin

Increasing Power in Association Studies by using
Linkage Disequilibrium Structure and Molecular
Function as Prior Information
Eleazar Eskin
UCLA
Motivation
• Whole genome association study
• How to perform multiple hypothesis
correction
– To increase statistical power
• Incorporate prior information on molecular function
of associated loci
• Information on linkage disequilibrium structure
Main idea
• Traditional method
– Use a single significance threshold
• In practice, markers are not identical
• Set a different threshold at each marker,
which reflects both intrinsic (e.g. LD, allele
freq.) and extrinsic information on the
markers
Standard Association Study
•
•
•
•
M markers in N cases and N controls
fi = minor allele frequency at marker i


p
/
p
True case/control allele frequency i i
Marker d: casual variant with a relative risk 
f d
 
 pd 
f d  (1  f d )

 p  f
d
 d
Standard Association Study
• Test statistic
~ N(
,1)
• Power at a single marker (probability of
detecting an association with N individuals
at p-value or significance threshold t
Multiple Hypothesis correction
• Fix the false positive rate at each marker so
that the total false positive rate is α
• Bonferroni correction
– ti= α/M
• Expected power:
where ci is the probability of marker i to be causal
 Probability of rejecting the correct null
hypothesis
Multi-Threshold Association
• Allow a different threshold ti for each
marker
• Power:
with adjusted false positive rate
• Goal: set values for ti to maximize the
power subject to the constraints
Maximizing the Power
• Gradient at each marker will be equal at the
optimal point
• Given a value of gradient, solve for the threshold
at each marker to achieve that gradient
• Do binary search over the gradient until
thresholds sum to α
Maximizing Power for Proxies
• In practice, markers are tags for causal variation
• Given K variants, assign each potential causal
variation vk to the best marker i
• The effective non-centrality parameter is reduced
by a factor of |rki| where rki is the correlation
coefficient between variant k and marker i.
• If vk is causal, the power function when observing
proxy marker i is P (t , | r |  , N )
s
ki
k
Maximizing Power for Proxies
• Each variant k has a prob of being causal ck
• The total power captured by each marker i
Pm (ti , Ti , N )  v T ck Ps (t , | rki | k N , N )
k
i
• The total power of the association study
M
P(t1 , t 2 ,..., t M )   Pm (ti , Ti , N )
i 1
M
   ck Ps (t , | rki | k N , N )
i 1 vk Ti
Candidate Gene study
• 1000 cases and controls over ENCODE regions using
markers in Affymetrix 500k genechip
Robustness over relative risks
Whole Genome Association
• Assumption
– Each SNP is equally likely to be causal with
relative risk of 2
• Power for traditional study and multithreshold association for 2,614,057 SNPs
– avg: 0.593 / 0.610
– Avg over power in [0.1, 0.9]: 0.568 / 0.615
Impact of extrinsic information
1. cSNPs are more likely to be involved in disease
2. Add information on se of genes which are more
likely to be involved in specific disease
•
•
30,700 cSNPs in HapMap contributes to 20% of
the disease causing variation
Cancer Gene Census: 363 genes in which
mutations have been implicated in cancer. 20%
of causal variation is assumed in these genes