SNP Aggregation Analysis

Download Report

Transcript SNP Aggregation Analysis

1
A Kernel Machine Approach
for Detecting Associations with
Rare Copy Number Variants
Jung-Ying Tzeng
Bioinformatics Research Center &
Department of Statistics
NC State University
Department of Statisitcs, National Cheng Kung University
December 25, 2014
Acknowldegement
• Joint work with Patrik Magnusson, Patrick Sullivan,
Jin Szatkiewicz and the Swedish Schizophrenia
Consortium
• Many of the slides are obtained from
– Dr. Debashis Ghosh at Colorado School of Public
Health
– Dr. Michael Wu at Fred Hutchinson Cancer
Research Center
3
A Kernel Machine Approach
for Detecting Associations
with Rare Copy Number Variants
Genetic Association Studies
Genetic Association Analysis
• Goal: Identification of individual genes or variants associated with
phenotypes, e.g., diseases
• How?
ACCAATCGGATCAATTCCGATCCAATTCCATACATACCGATTTAACCCAA
ACCAATCGGCTCAATTCCGATCTAATTTCATACATACCAATTTGACCCAA
controls
ACCGATCGGATCAATTCCGATCCAATTCCATACATACCGATTTAACCCAA
ACCGATCGGCTCAATTCCAATCCAATTCCATACATACCAATTTGACCCAA
ACCAATCGGATCAATTCCAATCTAATTTCATACATACCAATTTAACCCAA
ACCAATCGGATCAATTCCAATCCAATTCCATACATACCAATTTAACCCAA
cases
ACCGATCGGCTCAATTCCAATCCAATTCCATACATACCGATTTGACCCAA
ACCGATCGGATCAATTCCGATCCAATTCCATACATACCGATTTGACCCAA
ACCAATCGGCTCAATTCCGATCCAATTCCATATATACCGATTTGACCCAA
Cases
Controls
Genetic
Contents
Source: MW
Genetic Association Analysis
• Goal: Identification of individual genes or variants associated with
phenotypes, e.g., diseases
• How?
ACCAATCGGATCAATTCCGATCCAATTCCATACATACCGATTTAACCCAA
ACCAATCGGCTCAATTCCGATCTAATTTCATACATACCAATTTGACCCAA
controls
ACCGATCGGATCAATTCCGATCCAATTCCATACATACCGATTTAACCCAA
ACCGATCGGCTCAATTCCAATCCAATTCCATACATACCAATTTGACCCAA
ACCAATCGGATCAATTCCAATCTAATTTCATACATACCAATTTAACCCAA
ACCAATCGGATCAATTCCAATCCAATTCCATACATACCAATTTAACCCAA
cases
ACCGATCGGCTCAATTCCAATCCAATTCCATACATACCGATTTGACCCAA
ACCGATCGGATCAATTCCGATCCAATTCCATACATACCGATTTGACCCAA
ACCAATCGGCTCAATTCCGATCCAATTCCATATATACCGATTTGACCCAA
Cases
Controls
Genetic
Contents
Source: MW
Genetic Association Analysis
• Goal: Identification of individual genes or variants associated with
phenotypes, e.g., diseases
• How?
ACCAATCGGATCAATTCCGATCCAATTCCATACATACCGATTTAACCCAA
ACCAATCGGCTCAATTCCGATCTAATTTCATACATACCAATTTGACCCAA
controls
ACCGATCGGATCAATTCCGATCCAATTCCATACATACCGATTTAACCCAA
ACCGATCGGCTCAATTCCAATCCAATTCCATACATACCAATTTGACCCAA
ACCAATCGGATCAATTCCAATCTAATTTCATACATACCAATTTAACCCAA
ACCAATCGGATCAATTCCAATCCAATTCCATACATACCAATTTAACCCAA
cases
ACCGATCGGCTCAATTCCAATCCAATTCCATACATACCGATTTGACCCAA
ACCGATCGGATCAATTCCGATCCAATTCCATACATACCGATTTGACCCAA
ACCAATCGGCTCAATTCCGATCCAATTCCATATATACCGATTTGACCCAA
Cases
Controls
Genetic
Contents
Source: MW
Genetic Association Analysis
• Goal: Identification of individual genes or variants associated with
phenotypes, e.g., diseases
• How?
ACCAATCGGATCAATTCCGATCCAATTCCATACATACCGATTTAACCCAA
ACCAATCGGCTCAATTCCGATCTAATTTCATACATACCAATTTGACCCAA
controls
ACCGATCGGATCAATTCCGATCCAATTCCATACATACCGATTTAACCCAA
ACCGATCGGCTCAATTCCAATCCAATTCCATACATACCAATTTGACCCAA
ACCAATCGGATCAATTCCAATCTAATTTCATACATACCAATTTAACCCAA
ACCAATCGGATCAATTCCAATCCAATTCCATACATACCAATTTAACCCAA
cases
ACCGATCGGCTCAATTCCAATCCAATTCCATACATACCGATTTGACCCAA
ACCGATCGGATCAATTCCGATCCAATTCCATACATACCGATTTGACCCAA
ACCAATCGGCTCAATTCCGATCCAATTCCATATATACCGATTTGACCCAA
Cases
Controls
Genetic
Contents
Source: MW
Genetic Association Analysis
• Measure of genetic contents
– Not every base exhibits variations. Instead of studying the
entire DNA sequences, only focus on a subset of them, i.e.,
genetic markers
– A genetic marker is a portion of a DNA molecule where exist
variations among individuals
ACCAATCGGATCAATTCCGATCCAATTCCATACATACCGATTTAACCCAA
ACCAATCGGCTCAATTCCGATCTAATTTCATACATACCAATTTGACCCAA
controls
ACCGATCGGATCAATTCCGATCCAATTCCATACATACCGATTTAACCCAA
ACCGATCGGCTCAATTCCAATCCAATTCCATACATACCAATTTGACCCAA
ACCAATCGGATCAATTCCAATCTAATTTCATACATACCAATTTAACCCAA
ACCAATCGGATCAATTCCAATCCAATTCCATACATACCAATTTAACCCAA
cases
ACCGATCGGCTCAATTCCAATCCAATTCCATACATACCGATTTGACCCAA
ACCGATCGGATCAATTCCGATCCAATTCCATACATACCGATTTGACCCAA
ACCAATCGGCTCAATTCCGATCCAATTCCATATATACCGATTTGACCCAA
Source: MW
Types of Genetic Variations
• Measure of genetic contents
– Not every base exhibits variations. Instead of studying the
entire DNA sequences, only focus on a subset of them, i.e.,
genetic markers
– A genetic marker is a portion of a DNA molecule where exist
variations among individuals
Figure Source: Figure 9.1 of Schork et al. 2008
Types of Genetic Variations
Mutation at single
position (nucleotide)
=Single Nucleotide
Polymorphism
(SNP)
Changes in the number of DNA
copies
= Copy Number Variants
(CNV)
11
A Kernel Machine Approach
for Detecting Associations
with rare Copy Number Variants
Copy Number Variants
12
Copy Number Variants (CNVs)
• CNVs : changes in the number of DNA copies comparing to the
reference
(Source: Ferreira and Purcell 2009)
Duplication
...CG
1bp - Mb
...CG
Deletion
...CG ATG...
ATG...
ATG...
...GTGGGG...
...TTGAA...
• Although SNPs outnumber CNVs, their relative contributions to genomic
variation (as measured in nucleotides) are similar (Malhotra and Sebat 2012)
SNPs
CNVs
Estimate
~1 in 1000 b
> 1000 CNVs
Base pairs
~4 Mb
~4 Mb
% genome
~0.1%
~0.1%
Mutation rate
10-8
10-4 to 10-6
13
Copy Number Variants (CNVs)
• CNVs alter the “dosage” of genes in the deleted or duplicated
region, which may affect disease risk
Ex. CNVs play an important role in the etiology of multiple
psychiatric disorders, e.g., developmental delay, autism,
schizophrenia
Malhotra
and Sebat
2012
Malhotra
and Sebat
2012
16
Aggregation Analysis for Rare CNVs
• Due to rarity of the events (e.g., 1 out of 5000 people), rare
variants are analyzed on the unit of “variant set” instead of one
by one
 This is referred to as “aggregation analysis”
• Aggregation analysis of all rare CNVs serves as a key approach
to evaluate the collective effect of rare CNVs (Sullivan et al. 2012;
Collins and Sullivan 2013; Malhotra & Sebat 2012)
• CNVs are typically collapsed across the genome
– Ex. a greater genome-wide burden of rare CNVs in SCZ
cases than in controls (Walsh et al. 2008 Science; International
Schizophrenia Consortium 2008 Nature; Kirov et al. 2009 Hum. Mol.
Genet; Buizer-Voskamp et al. 2011 Biol. Psychiatry)
or within pathways or genes
– Ex. the burden of rare CNVs in NRXN1 was significantly
greater in SCZ cases than in controls (Szatkiewicz et al. 2014)
17
Developments in SNP Aggregation Analysis
• Many methods have been developed for performing
aggregation analysis for SNPs
• Depending on how genotype effects are modeled, SNP
aggregation methods can be roughly classified into
1. Burden-based approaches
(Fixed effects approaches)
2. Variance-component approaches
(Random effects approaches)
18
Method Review of
SNP Aggregation Analysis
19
SNP Aggregation Analysis
Mutation Burden
1. Burden-based approaches
(Fixed effects approaches)
𝛽𝑚
• Regress traits on weighted genotype sum of all loci
g  EYi   X i   *  w1G1i   wM GMi ,
where Gmi  0,1, 2
Test H 0 :  *  0 for association
• Focus on testing mean level of genetic effects
• Methods differ by the choices of weights, e.g.,
− Minor allele frequencies (MAF) (e.g., Madsen and Browning 2009)
− Functionality (Price et al. 2010)
− Estimated effect size (Lin and Tang 2011)
• Optimal if the effects of different loci are additive, have similar
size and same direction
20
SNP Aggregation Analysis
2. Variance component based approaches
(Random effects approaches)
• Focus on testing variance level of genetic effects
• Basic Idea :
g  EYi   X i  1G1i 
𝛽𝑚
  M GMi , where Gmi  0,1, 2
Assume  ~ N  0,    
Test H 0 :   0 for association
• In general,
g  EYi   X i  gi ,
where g  [ g1 ,
, g n ]T ~ N (0, K nn )
 𝐾 = {𝑘𝑖𝑗 }, 𝑘𝑖𝑗 = genetic similarity between 𝑖 and 𝑗
• Optimal if effects of different loci are interactive / non-linear or
very different (e.g., zero effect or opposite direction)
21
SNP Aggregation Analysis
2. Variance component based approaches --- a kernel machine
perspective
• Consider a more general model for association
𝑔 𝐸𝑌𝑖 = 𝑋𝑖 𝛾 + ℎ 𝐺1𝑖 , ⋯ , 𝐺𝑀𝑖 ≡ 𝑋𝑖 𝛾 + ℎ 𝑮𝒊
• To gain flexibility in modeling genetic effects, we specify ℎ 𝑮𝒊 using a
certain kernel function 𝑘 ⋅,⋅ to capture linear/nonlinear additive/nonadditive effects
• 𝑘 ⋅,⋅ projects 𝑮𝒊 from its original space to a new space that 𝑘 ⋅,⋅
implicitly specifies, and in this new space, ℎ(𝑮𝒊 ) is modeled linearly,
i.e., ℎ 𝑮𝒊 = 𝑗 𝛼𝑗 𝑘 𝑮𝒊 , 𝑮𝒋 and 𝛼’s are unknown parameters
• This is termed as kernel machine model (Lin et al. 2007, Liu et al.
2008)
• 𝑘 𝑮𝒊 , 𝑮𝒋 captures the genetic similarities between individuals 𝑖 and 𝑗
22
SNP Aggregation Analysis
2. Variance component based approaches --- a kernel machine
perspective (continued)
• Kernel Machine (KM) model:
𝑔 𝐸𝑌𝑖 = 𝑋𝑖 𝛾 + ℎ 𝑮𝒊 , where ℎ 𝑮𝒊 =
𝑗
𝛼𝑗 𝑘 𝑮𝒊 , 𝑮𝒋
• In matrix format:
𝑔 𝐸𝑌 = 𝑋𝛾 + ℎ(𝑮), where ℎ 𝑮 = 𝑲𝛼,
where 𝛼 𝑇 = (𝛼1 , ⋯ 𝛼𝑛 ) and 𝑲𝑛×𝑛 = { 𝑘 𝑮𝒊 , 𝑮𝒋 }
• The kernel machine estimators of 𝛾 and ℎ are identical to the BLUP
estimators under GLMM (Lin et al. 2007, Liu et al. 2008)
𝑔 𝐸𝑌 = 𝑋𝛾 + g with g ∼ 𝑁(0, 𝜏𝑲)
(or 𝑔 𝐸𝑌 = 𝑋𝛾 + 𝛼 with 𝛼 ∼ 𝑁(0, 𝜏𝑲−𝟏 ) )
23
SNP Aggregation Analysis
2. Random effects approaches
• Methods differ by the choices of weights and 𝑘𝑖𝑗
E.g., Global test (Goeman et al. 2004)
𝑘𝑖𝑗 = 𝐺𝑖𝑇 𝐺𝑗 and no weights
C-alpha method (Neale et al. 2011)
𝑘𝑖𝑗 = 𝐺𝑖𝑇 𝐺𝑗 and with weight = I{MAF < cut}
Kernel machine regression (Kwee et al. 2008, Wu et al. 2010, 2011)
𝑘𝑖𝑗 = IBS at locus 𝑚 between 𝑖 and 𝑗 and weight = (1-MAF)24
Similarity Regression (Tzeng et al. 2009, 2011, Wang et al. 2014;
Zhao et al. 2014)
𝑘𝑖𝑗 = IBS at locus 𝑚 between 𝑖 and 𝑗 and weight = 𝑀𝐴𝐹 −1
24
Challenges in CNV Aggregation Analysis
--- Cautions on applying SNP aggregation methods
1.
Copy number (dosage) is not binary
– Deletion (0,1), normal copy (2) and duplication (3,4+)
– SNP collapsing methods assume binary event (i.e.,
mutant allele vs. not) and only keep track of number of
“events”
2.
CNV polymorphisms are multi-faceted
– CNVs can vary in dosage, length and details of gene
intersections
25
Multi-faceted Nature of CNVs
Kirov et al 2009
26
Challenges in CNV Aggregation Analysis
--- Cautions on applying SNP aggregation methods
1.
Copy number (dosage) is not binary
– Deletion (0,1), normal copy (2) and duplication (3,4+)
– SNP collapsing methods assume binary event (i.e.,
mutant allele vs. not) and only keep track of number of
“events”
2.
CNV polymorphisms are multi-faceted
– CNVs can vary in dosage, length and details of gene
intersections
– Each of these ”features” affects CNVs’ impact on
disease risk.
– SNP collapsing methods target only on one feature (i.e.,
mutation burden).
27
Challenges in CNV Aggregation Analysis
--- Cautions on applying SNP aggregation methods
3. Etiological heterogeneity is often observed in CNVs
– Different dosage may have different effects
Ex. 22q11.2 deletion is a risk factor for SCZ (Bassett et al. 2005;
Murphy et al. 1999) whereas 22q11.2 duplication is a protective
factor (Rees et al. 2014)
Ex. In gene VIPR2, triplication has higher risk than duplication for
SCZ (Vacic et al. 2011)
– Collapsing with random effects methods have greater
potentials than fixed effect methods for CNV analyses (for
between-locus heterogeneity)
– Cautions are still needed for within-locus heterogeneity
28
Current CNV Aggregation Methods
(All are fixed effects methods)
1. PLINK Burden Tests (International Schizophrenia Consortium 2008; Kirov et
al. 2009)
– Dichotomize CNV genotypes based on the event of interests, e.g.,
• CNV (≠ 2) vs. no CNVs (= 2)
• Del (<2) vs. No Del
• Dup (>2) vs. No Dup
• Genes intersected (GI) by CNVs vs. no GI
– Compare the event rates between cases and controls
– Drawbacks:
• Need to dichotomize data based on event of interests
• Do not address the issue of etiological heterogeneity
• Only evaluate marginal effects of a CNV feature, which
subjects to spurious association (Raychaudhuri et al. 2010)
Under 𝑯𝟎 :no GI effect (Raychaudhuri et al. 2010)
Scenario
S0
S1
S2
S3
S4
Cases
CNV rate
Mean size (kb)
0.25
0.25
0.25
0.25
0.25
100
100
60
100
60
Controls
CNV rate
Mean size (kb)
0.25
0.05
0.25
0.05
0.05
100
100
100
60
100
29
30
Current CNV Aggregation Methods
2. PLINK Enrichment Tests (Raychaudhuri et al. 2010)
–
Total # of CNVs
Mean CNV size
(kb)
# of genes intersected
(GI) by a CNV
– Pros: assess conditional effect of CNV features and avoid
spurious association
Under 𝑯𝟎 :no GI effect (Raychaudhuri et al. 2010)
Scenario
S0
S1
S2
S3
S4
Cases
CNV rate
Mean size (kb)
0.25
0.25
0.25
0.25
0.25
100
100
60
100
60
Controls
CNV rate
Mean size (kb)
0.25
0.05
0.25
0.05
0.05
100
100
100
60
100
31
32
Current CNV Aggregation Methods
2. PLINK Enrichment Tests (Raychaudhuri et al. 2010)
–
Total # of CNVs
Mean CNV size
(kb)
# of genes intersected
(GI) by a CNV
– Pros: assess conditional effect of CNV features and avoid
spurious association
– Cons:
• Need to dichotomize data based on event of interests
• Do not address the issue of etiological heterogeneity
33
Proposed CNV Aggregation Strategy
34
Plan
• Use kernel machine (random effects) approaches
– To account for between-locus and within-locus
etiological heterogeneity
• Model multiple features of CNVs
– To assess the conditional effect of a CNV feature
• Accommodate multi-nominal nature of dosage (and GI)
– To avoid dichotomizing data
35
1. Input Data Format
(0) Start with a PLINK format CNV file
(1) Define CNV region (CNVR):
– Clusters of CNV segments with ≥1bp overlap
CNVR1
CNVR2
-------------------------------------------------------------------------------------------------1
2
3
4
5
…
– Retain region-specific effect when collapsing
36
1. Input Data Format
(2) Create design matrix for each CNV feature: dosage, length, and gene
intersection
CNVR𝑴
– Dosage (DS) : CNVR𝟏
𝐷𝑆
𝑧11
𝐷𝑆
𝑧1𝑀
⋯
⋱
⋮
⋮
𝐷𝑆
𝑧𝑛1
– Length (Len):
∈ {0,1,2,3,4}
𝐷𝑆
𝑧𝑖𝑚
⋯
CNVR𝟏
𝐿𝑒𝑛
𝑧11
⋮
𝐿𝑒𝑛
𝑧𝑛1
⋱
𝐷𝑆
𝑧𝑛𝑀
CNVR𝑴
𝐿𝑒𝑛
𝑧1𝑀
⋯
⋱
𝐿𝑒𝑛
𝑧𝑖𝑚
⋯
⋮
⋱
𝐿𝑒𝑛
𝑧𝑛𝑀
length in kb
37
1. Input Data Format
(2) Create design matrix for each CNV feature: dosage, length, and
gene intersection
– Gene intersection (GI) :
Gene𝟏
𝐺𝐼
𝑧11
⋯
⋱
⋮
𝐺𝐼
𝑧𝑖ℓ
𝐺𝐼
𝑧𝑛1
⋯
Gene 𝑳
𝐺𝐼
𝑧1𝐿
⋮
⋱
𝐺𝐼
𝑧𝑛𝐿
∈
0 ∶ no GI
1 ∶ GI by a Del
2 ∶ GI by a Dup
38
2. Model
• For subject 𝑖,
𝑌𝑖 be the continuous or binary trait,
𝑋𝑖 be a 𝑝 × 1 covariate vector including the intercept, and
𝑓
𝑍𝑖
=
𝑇
𝑓
𝑓
𝑍𝑖1 , ⋯ , 𝑍𝑖𝑀𝑓
= design vector of feature 𝑓 , 𝑓 ∈ {𝐷𝑆, 𝐺𝐼, 𝐿𝑒𝑛}
• Model
𝑔(𝜇𝑖 ) = 𝛾0 𝑋𝑖 + ℎ𝐷𝑆 𝑍𝑖𝐷𝑆 + ℎ𝐺𝐼 𝑍𝑖𝐺𝐼 + ℎ𝐿𝑒𝑛 𝑍𝑖𝐿𝑒𝑛
• Assume 𝑌𝑖 ∼ exponential family with density
𝑓𝑌 𝑌𝑖 ; 𝜃𝑖 , 𝜙 = exp 𝜃𝑖 𝑌𝑖 − 𝑏(𝜃) / 𝜙𝑚𝑖 + 𝑐 𝑦𝑖 , 𝜙
where 𝜃𝑖 = 𝑔 𝜇𝑖 and
ℎ𝑓 ⋅ models the effect of CNV feature 𝑓
39
2. Model
•
𝑓
ℎ𝑓 𝑍𝑖
=
𝑓
𝑓
𝑗 𝛼𝑗 𝑘(𝑍𝑖 , 𝑍𝑗 ), where 𝛼𝑗 ’s are unknown parameters
• Use polynomial kernel:
𝑓 𝑓
𝑘(𝑍𝑖 , 𝑍𝑗 )
= 1+
𝑀𝑓
𝑚=1 𝑤𝑚
×
𝑓
𝑍𝑖𝑚
×
𝑓
𝑍𝑗𝑚
𝑑
– 𝑑 = 1 (linear kernel), only model linear main effect
– 𝑑 = 2 (quadratic kernel), model the linear and quadratic main
effects as well as two-way interactions
40
2. Model
• 𝑔(𝐸𝑌) = 𝑋𝛾 + ℎ𝐷𝑆 𝑍 𝐷𝑆 + ℎ𝐺𝐼 𝑍 𝐺𝐼 + ℎ𝐿𝑒𝑛 𝑍 𝐿𝑒𝑛
• “Fully” kernel machine model (i.e., using kernel functions for all
features) runs very slow with large 𝑛 (𝑒. 𝑔. , > 103 )
– The equivalent GLMM representation:
𝑔 𝜇𝑖 = 𝑋𝛾 + ℎ𝐷𝑆 + ℎ𝐺𝐼 + ℎ𝐿𝑒𝑛
𝑓
𝑓
where ℎ𝑓 ∼ 𝑁(0, 𝜏𝑓 𝐾𝑓 ) and 𝐾𝑓 = 𝑘(𝑍𝑖 , 𝑍𝑗 )
– Under 𝐻0 : ℎ𝐷𝑆 = 0, there are two nuisance variance
components to estimate
• To speed up the computation, propose to model the covariates
and background CNV features using fixed effects (burden
summary) and model the CNV feature of interests using kernel
machine (random effects)
41
Example: Assessing Dosage Effect
• Generalized linear mixed model:
𝑔(𝜇1𝑖 ) = 𝛾1 𝑋𝑖 + 𝛽𝐿𝑒𝑛 𝑍𝑖𝐿𝑒𝑛 + 𝛽𝐺𝐼 𝑍𝑖𝐺𝐼 +
||
ℎ𝐿𝑒𝑛 (𝑍𝑖𝐿𝑒𝑛 )
||
ℎ𝐺𝐼 𝑍𝑖𝐺𝐼
ℎ𝑖𝐷𝑆
||
ℎ𝐷𝑆 (𝑍𝑖𝐷𝑆 )
– ℎ𝑖𝐷𝑆 = subject specific DS effect;
ℎ𝐷𝑆
=
𝑇
𝐷𝑆
𝐷𝑆
ℎ1 , ⋯ , ℎ𝑛
∼ 𝑁(0, 𝜏𝐷𝑆 𝐾𝐷𝑆 )
where 𝐾𝐷𝑆 = 𝑛 × 𝑛 matrix with
𝐾𝐷𝑆 [𝑖, 𝑗] = similarity between 𝑍𝑖𝐷𝑆 𝑎𝑛𝑑 𝑍𝑗𝐷𝑆 ≡ 𝑘𝐷𝑆 𝑍𝑖𝐷𝑆 , 𝑍𝑗𝐷𝑆
– 𝑍𝑖𝑙𝑒𝑛 = mean CNV length in kb
– 𝑍𝑖𝐺𝐼 = the number of genes intersected =
– Dosage effect can be evaluated by testing
1
2
𝐿
𝐺𝐼
𝐼{𝑍
ℓ=1
𝑖ℓ ≠ 0}
𝐻0𝐷𝑆 : 𝜏𝐷𝑆 = 0
– Test statistic: 𝑇𝐷𝑆 = (𝑌 − 𝜇1 )𝑇 Δ1 𝑊1 𝐾𝐷𝑆 𝑊1 Δ1 (𝑌 −
42
Remark: Quantifying Similarity b/w 𝑍𝑖𝐷𝑆 𝐚𝐧𝐝 𝑍𝑗𝐷𝑆
Use the 𝑑-th order polynomial function
𝑘𝐷𝑆 𝑍𝑖𝐷𝑆 , 𝑍𝑗𝐷𝑆
𝑀
=
1+
𝑚=1
𝑑
𝐷𝑆
𝐷𝑆
𝑤𝑚 × 𝑍𝑖𝑚
× 𝑍𝑗𝑚
𝑤𝑚 is the pre-specified weight for locus 𝑚 based on, e.g., MAF
𝐷𝑆
• Cannot directly use 𝑍𝑖𝑚
in the kernel function
𝐷𝑆
𝐷𝑆
(∵ both 𝑍𝑖𝑚
< 2 and 𝑍𝑖𝑚
> 2 are deviated from “normal reference”)
• Solution: factorize dosage
0
0
0
=
0
𝑇
– 𝐺𝑖𝑚
=
𝑇
– 𝐺𝑖𝑚
– Then
1
1
1
0
2
0
2
0
3
0
3
1
𝑘𝐷𝑆 𝑍𝑖𝐷𝑆 , 𝑍𝑗𝐷𝑆
4
0
4
0
= 1+
𝐷𝑆
for 𝑍𝑖𝑚
= 1
𝐷𝑆
for 𝑍𝑖𝑚
=3
𝑀
𝑚=1 𝑤𝑚
𝑇
× 𝐺𝑖𝑚
𝐺𝑗𝑚
dosage-specific effect when collapsing
𝑑
, which retains
43
Simulation Studies
44
Simulation Scheme
• Obtain CNV data from TwinGene Project (Heijmans 2005; Silventoinen et
al. 2006)
– Cross-sectional sampling design
– 2000 unrelated samples (rarest CNV = 5 × 10−4 )
– 1757 CNVRs
– 688 genes (69 genes intersected by CNVs)
– Sample with replacement to form an individual’s CNV
– Determine 𝑌𝑖 (0 𝑜𝑟 1) based on CNV features of interests
– Simulate 𝑛 = 2000 individuals (1000 cases and 1000 controls)
45
Simulation Scheme
Scheme A. Different dosage effects of Dup and Del
A1. Between-locus heterogeneity
– Randomly select 300 Dup-only CNVRs and 300 Del-only CNVRs
as causal loci
A2. Within-locus heterogeneity
– Select the 38 CNVRs with both Dup and Del as causal
Scheme B. Different gene-Intersection effect of Dup and Del (i.e.,
heterogeneous effect of genes intersected by Dup and by Del)
B1. Across-gene heterogeneity
– Randomly select 26 genes with Dup intersection on only and 26
genes with Del intersection only as causal
B2. Within-gene heterogeneity
– Select the 8 genes with both Dup and Del intersection as causal
46
Type I Error for (A) Dosage Analysis
• Compare the proposed GLMM methods with
plink.all = PLINK CNV rates
plink.dup = PLINK Duplication rates
plink.del = PLINK Deletion rates
• Type I error rates (2000 replicates):
Model
GLMM
Plink.all
Plink.dup
Plink.del
Between-Locus
Heterogeneity
0.035
0.046
0.057
0.041
Within-Locus
Heterogeneity
0.041
0.051
0.057
0.043
Power (vs. plink 2 sided)
A1. (Dosage effect) Between-Locus Heterogeneity
All Dup causal are risky
50% Dup causal (Del causal) are
All Del causal are protective
harmful and 50% are protective
All Dup causal are protective
All Del causal are risky
No Heterogeneity
47
Power (vs. plink 2 sided)
A1. (Dosage effect) Between-Locus Heterogeneity
All Dup causal are risky
50% Dup causal (Del causal) are
All Del causal are protective
harmful and 50% are protective
48
Dup causal, 𝟏𝟎𝟎% risky, 0% protective
Del causal,
𝟎 % risky,100% protective
All Dup causal are protective
All Del causal are risky
No Heterogeneity
Dup causal, 𝟎% risky, 100% protective
Del causal, 𝟏𝟎𝟎 % risky,0% protective
A1. (Dosage effect) Between-Locus Heterogeneity
Power (vs. plink 2 sided)
49
Dup causal, 𝟕𝟎% risky
Del causal, 𝟑𝟎% risky
Dup causal, 𝟓𝟎% risky
Del causal, 𝟓𝟎 % risky
Dup causal, 𝟑𝟎% risky
Del causal, 𝟕𝟎 % risky
Power (vs. plink 2 sided)
A2. (Dosage effect) Within-Locus Heterogeneity
50
51
Simulation Scheme
Scheme A. Different dosage effects of Dup and Del
A1. Between locus heterogeneity
– Randomly select 300 Dup-only CNVRs and 300 Del-only CNVRs
as causal
A2. Within locus heterogeneity
– Select the 38 CNVRs with both Dup and Del as causal
Scheme B. Different gene-intersection effect of Dup and Del
B1. Across-gene heterogeneity
– Randomly select 26 genes with Dup intersection only and 26
genes with Del intersection only as causal
B2. Within-gene heterogeneity
– Select the 8 genes with both Dup and Del intersection as causal
52
Type I Error for Gene Intersection (GI) Analysis
• Compare the proposed GLMM methods with
PLINK Enrichment test (Raychaudhuri et al. 2010)
• Type I error rates:
Model
GLMM
Plink.enrichment
Between-Locus
Heterogeneity
0.041
0.044
Within-Locus
Heterogeneity
0.043
0.051
B1. (GI effect) Between-Gene Heterogeneity
53
Power (vs. plink.enrichment)
Dup causal : 50% are risky 50% are protectiv
All Dup causal are risky (100%)
All Del causal are protective (0%) Del causal : 50% are risky 50% are protectiv
All Dup causal are proactive(0%)
All Del causal are risky (100%)
No Heterogeneity
Power (vs. plink.enrichment)
B2. (GI effect) Within-Gene Heterogeneity
54
55
Application to Swedish SCZ Study
• Case-control design; 5536 subjects
• Genotyped by Illumina OmniExpress SNP array
• Perform analysis of >100kb CNVs (more heterogeneous) and
>500kb (more homogeneous)
Dosage
All >100kb CNVs
Plink
GLMM
0.89
0.067
56
Application to Swedish SCZ Study
• Case-control design; 5536 subjects
• Genotyped by Illumina OmniExpress SNP array
• Perform analysis of >100kb CNVs (more heterogeneous) and
>500kb (more homogeneous)
Dosage
Plink
GLMM
All >100kb CNVs
0.89
0.067
All >500kb CNVs
0.029
0.055
57
Application to Swedish SCZ Study
• Case-control design; 5536 subjects
• Genotyped by Illumina OmniExpress SNP array
• Perform analysis of >100kb CNVs (more heterogeneous) and
>500kb (more homogeneous)
Dosage
GI
Plink
GLMM
Plink
GLMM
All >100kb CNVs
0.89
0.067
0.045
0.030
All >500kb CNVs
0.029
0.055
0.004
0.018
58
Summary
For CNV collapsing analysis:
• Developments in SNP collapsing can be applied in CNV collapsing
with modification to account for the nature of CNVs,
e.g., defining “locus” using CNVR or gene,
calculating similarity based on factorized dosage / GI details,
adjusting for background CNV features
• KM (random effects) modeling has more potential to address
etiological heterogeneity
– KM test has robustness across different scenarios
– For GI, KM test is more powerful than plink.enrichment
• Note that KM test has the same model as plink.enrichment
except that GI effect is modeled using random effects with
factorized coding
59
Summary
• Extensions:
– Fast KM (a low-rank approximation method) to enhance
computational efficiency for the “fully” KM methods
– Adaptive test is needed to gain robustness and power,
consider a hybrid approach, such as the SKAT-O test for
rare SNP analysis (Lee et al., 2012)
Thank you