Transcript ppt
Association Tests for Rare Variants Using
Sequence Data
Guimin Gao, Wenan Chen, & Xi Gao
Department of Biostatistics, VCU
Introduction to Association tests: two
hypotheses
Common variant-common disease
Common variant: Minor allele frequencies (MAF) >= 5%
Using linkage disequilibrium(LD)
Rare variant-common disease
Rare variant: MAF < 1% (or 5%)
High allelic heterogeneity: collectively by multiple rare
variants with moderate to high penetrances
Associations through LD would not be suitable
Association tests for Common variants
Test a single marker each time
Cochran-Armitage’s trend test (CATT) (assuming additive
(ADD))
Power: High for additive (ADD) or Multiplicative (MUL); low recessive
(REC) or Dominant (DOM)
Genotype association test (GAT) using chi-square statistic
Power: a little lower for ADD, higher for REC
MAX3 = maximum of three trend test statistics across the
REC, ADD, and DOM models (Freidlin et al. 2002 Hum Hered.)
Power: lower than CATT under ADD
higher than CATT & CAT under REC
Association tests for Common variants
Test for single marker (CATT, GAT, & MAX3)
Low power when MAF <10%
No power for rare variants with MAF<1%
Multivariate test
Considering a group of variants (ex. SNPs in a gene) each time
Multiple logistic regression (or Hotelling test, Fisher’s product)
Logit
0
k
x ij
j
j 1
Xij = 0, 1, 2, the count. of the minor alleles of indiv i at locus j
Power: higher than single-marker test;
still very low due to large d.f = No. of SNPS = k
Need new methods for rare variants
Collapsing SNPs into a single marker to reduce d.f.
Outline
Introduction to association tests
Three well-known collapsing methods for rare
variants: CAST, CMC, & Weighted Sum methods
An evaluation using GAW 17 data
Extension to the three collapsing methods
Future research
Three association tests for Rare variants
Collapsing a set of rare variants (into a single marker)
A cohort allelic sums test (CAST) (Morgenthaler & Thilly
2007, Mutat. Res.)
Combined Multivariate and Collapsing (CMC) (Li &
Leal, 2007, AJHG)
Division into subgroups, collapsing in each subgroup
Weighted Sum statistic (Madsen & Browning, 2009; PloS
Genet. Price et al. 2010, AJHG)
A cohort allelic sums test (CAST)
A group of n variants (SNPs) in a unit (ex. one gene, LD
block)
Collapsing the genotypes across the variants
Indicator coding for individual j
xj = 1, if rare alleles present at any of the n variants;
xj = 0, otherwise
Testing if the proportions of individuals with rare variants (xj
= 1) in cases and controls differ
Higher power than method testing single variant each time
Only for rare variants
Combined Multivariate and Collapsing (CMC)
Method (Li & Leal 08)
Consider SNPs in a unit with MAF< a threshold (0.01 or
0.05)
Division and Collapsing
Divided into several sub-groups based on the MAF
Ex. Subgroups : (0, 0.001], [0.001, 0.005), [0.005, 0.01)
SNPs are collapsed in each sub-group
xij = 1, if indiv j has rare alleles present in the i-th subgroup;
xij = 0, otherwise
Combined Multivariate and Collapsing (CMC)
Method (Li & Leal 08)
Multivariate test of collapsed sub-groups
Hotelling T2 test, MANOVA, Fisher’s product method
Power: often higher than CAST
Different threshold may have different power
Weighted Sum Method (Madsen & Browning 09)
A group of variants (SNPs) in a unit
A weight for SNP i by the S.t.d of No. of minor alleles in the
sample
wi n i q i (1 q i )
qi is the minor allele freq in controls
Calculate a weighted genetic score for indiv j
L
vj
I ij / w i
i 1
Iij = 0, 1, 2, the count of the minor allele of indiv i at locus j
Obtain the Rank (Vj); Sum of the ranks of affected indivs
x
rank ( v
j A
j
)
Permutation for p-value estimation
From observed data:
x
rank ( v
j
)
j A
Permutation to estimate p-value:
Phenotype labels are permuted 1000 times, x1, …x1000
Calculate the mean (μ) and standard deviation (σ) of 1000 xs
z
x u
Assume z ~ N(0, 1) under null hypothesis
Obtain the p-value from N(0, 1)
Fast, p-value ~U[0,1]
Weighted Sum Method (Madsen & Browning 09)
Power comparison:
Simulations assuming genotypic relative risk is proportion to
MAF at disease loci (Madsen & Browning 09)
Weighted Sum Method (WSM) > CMC > CAST
(WSM) > CMC may not be true in other situations
Can be applied to rare variants & common variants
Disadvantage:
Give very high weights to very rare alleles (singleton), very low
weights to common variants.
An evaluation of the CMC method and Weighted
sum method by using GAW 17 data
Both methods are powerful (based on the authors’ simulation)
Our evaluation based on simulated datasets from GAW 17
GAW 17 data:
a subset of genes with real sequence data available in the 1000
genome project
Simulated phenotypes
Unrelated individuals, families
Dataset of 697 unrelated individuals
24487 SNPs in 3205 genes from 22 autosomal chromosomes
Only test for the 2196 genes with non-synonymous SNPs
GAW 17 dataset of unrelated individuals
Four phenotypes: Q1, Q2, Q4 and disease status.
Q1, Q2, and Q4 are quantitative traits
Q1 associated with 39 SNP in 9 genes,
Q2 associated with 72 SNPs in 13 genes
Q4: not related to any genes
Disease status is a binary trait: affected or unaffected, associated
with 37 genes
200 simulated phenotype replicates
Only one replicate of genotype data (original data)
Transforming Phenotypes
Methods: case-control design
Transform Q1, Q2, Q4 into binary traits
Splitting at the top 30% percentile of the distributions
Criteria for evaluation of Tests
Familywise error rate (FWER)
2196 genes with non-synonymous SNPs, 2196 tests
2196 null hypotheses Hj0: gene not associated with the trait
Q1 associated in 9 genes, 9 null hypotheses are not true.
(2196-9) null hypotheses are true
FWER = Pr(reject at least one true null hypothesis) = Nf/200
Nf : No. of replicates, at least one true hypothesis are rejected
Average Power
Mean of power for all the 9 genes that affect the phenotypes
Evaluating power: Q1, Q2, Disease
Evaluate FWER: Q4
Distribution of MAF in the GAW 17 dataset
Figure 1. Distribution of MAF of 24487 SNPs in GAW 17
Figure 1. Group SNPs based on MAFs for
CMC
0 - 0.01
0.01 - 0.1
Similar to Madsen & Browning (2009)
>=0.1
Table 1: Average power
Traits
CMC method
Weighted sum
method
Q1
0.144
0.112
Q2
0.00615
0.00308
Disease
0.00444
0.00500
Table 2: FWER (nominal α = 0.05)
Trait
CMC method
Weighted sum
method
Q4
0.115
0.0100
• CMC has FWER inflation
• Population stratification or admixture,
Samples from Asian, Europe,…
• Relatedness among samples
•
Similar results in Power and FWER were reported at GAW 17
Variable-Threshold Approach (Price et al 2010)
Given a threshold T, calculate a score for indiv j
L
V
j
(T )
I ij
i 1
Iij = 0, 1, 2, the count of the minor allele of indiv i at locus j
Calculate the sum of score for cases:
N
V (T )
D
V
j
(T )
j 1
Calculate Z(T) = V(T)/Var(V(T))
Find T to maximize Z(T), Zmax = max (Z(T))
Permutation to estimate p-value for Zmax
Power: >CMC;
Extended to quantitative traits
A weighted approach (Price et al 2010)
Calculate a weighted score for indiv j
L
V
j
w i I ij
i 1
Iij = 0, 1, 2
Calculate the sum of score for cases
N
V
D
V
j
j 1
Possible weight
w i 1 / q i (1 q i )
Power: similar to the weighted sum method (Madsen & Browning 09)
A weighted approach (Price et al 2010)
Calculate the sum of score for cases
N
V
L
D
V
j 1
j
V
j
w i I ij
i 1
Iij = 0, 1, 2
Calculate weight by the prediction of functional effects
PolyPhen-2 is used to predict damaging effects of missense
mutations with probabilistic scores.
Probabilistic scores as weights may reduce the noise of nonfunctional variants.
Higher Power than other methods
A data-adaptive sum test (Han & Pan 2010, Hum
Hered)
Logistic model
Logit
0 c
k
x ij
j 1
xij = 0, 1, 2, the count of the minor allele of indiv i at locus j
Effect on opposite directions
Logit
0
j
x ij
If j <0, with p-value < threshold (0.1), change xij
into 2- xij
Permutation to estimate p-value
Conclusion
Collapsing methods have higher power than single-marker
test
For genome-wide data analysis, collapsing methods don’t
have much power after multiple testing adjusting
Weighted sum methods are promising, need prior
information from biological data
Future research
Modifying the weighted sum method (in progress)
Very high weights to very rare variants
Smoothing weights w’ = 0.5w +0.5 (average of all w)
Thank you