Transcript Document

Using biological networks to search for
interacting loci in genome-wide
association studies
Mathieu Emily et. al.
European journal of human genetics,
e-pub aop 11 March 2009
Probable reasons for failure to get significant
understanding from GWAS:
• Environmental effect – genes are assumed to interact with their
environment – many efforts are going on to deal this.
• Coverage of SNP chips is not perfect – some technological problem.
• There may be many rare, highly penetrant variants that association
mapping is not designed for – resequencing data can be of help here.
• The detection of variants with low genotypic risk might require
sample sizes larger than what is currently available – increasing the
sample size for replication of high ranking SNPs, as well as
combining different data sets in meta analysis will provide insight
into very low odds ratio variants.
• Susceptibility might be caused by the interaction of genetic variants
– epistasis effect, discussed here.
Definition of Epistasis:
Biological: (Defined by Bateson, 1909) Epistasis is described as a masking effect
whereby a variant or allele at one locus prevents the variant at another locus
from manifesting its effect.
Statistical: (Defined by Fisher, 1918) Epistasis is deviation from additivity in the
effect of alleles at different loci with respect to their contribution to a
quantitative phenotype. This definition is closer to the usual concept of
statistical interaction: departure from a specific linear model describing the
relationship between predictive factors – here choice of scale becomes
important, since factors that are additive (measured on one scale) may exhibit
different interaction in a transformed scale.
Detection of Epistasis:
Problems:
Detection of epistasis requires improvements in analysis methods rather than
genotyping technology.
Traditional methods of analysis such as liear and logistic regression have had
limited success due to the sparseness of data in high dimension. Searching
exhaustively for two-locus epistasis using a 500-k chip requires testing of 125
billion SNP pairs – challenging both statistically and computationally.
Statistically it implies that significant tests after correction for multiple testing
should have P-values lower than 10-13. Extrapolating from single variant
findings, such low P-values should be very rare for the sample sizes of existing
studies.
Although it is computationally possible to perform 125 billion tests, these tests
have to be very simple to be run in a reasonable time even on large CPU
clusters.
Detection of Epistasis:
Possible Solutions:
Epistatic interaction search should be prioritized by expert knowledge from
biology.
As many epistatic models result in some marginal effect, an obvious
approach is to restrict the search to marker pairs where at least one of the
markers shows a single association. Simulations have proved that this
approach can be powerful, but so far its use on genome-wide real data sets
has not been reported.
A complement approach is to restrict the search to marker sets that a priori
are expected to interact on the basis of our biological knowledge, such as
knowledge extracted from protein interaction databases may allow for a
more efficient analysis of genome-wide studies (Pattin and Moore, 2008).
Here Emily et al postulated that two genes that biologically interact are good
candidates to a statistical analysis of epistasis in susceptibility to complex
diseases. Search space is reduced to SNPs belonging to gene pairs known to
interact and referenced in protein databases – tested on WTCCC data.
Methods:
Database used: STRING (combines reported interactions from dedicated
interaction databases and multipurpose databases centered on specific model
organism.). Each protein-protein interaction has a confidence score.
Interactions with confidence score > 0.7 are focused and only autosomal
chromosomes were used.
There are ~71,000 potential protein-protein interactions eligible for testing.
For each relevant protein, corresponding genes were located using
ENSEMBL database and all SNPs typed in a region of 100kbp on either side
of the gene were identified, as there may be regulatory variants or SNPs in
significant linkage disequilibrium (LD) to the gene at this distance.
Methods: (contd.)
Application: on WTCCC data for 7 diseases, each having 2000 cases and
3000 controls.
Filtering of WTCCC data: Genotypes with ‘Missing data’ (with posterior
probability < 0.95) removed. Markers were removed if the % of missing data
was > 1%, the MAF < 10% [not dealing with rarer ones] or if not in HWE
(p<0.05).
To remove statistical interaction caused by LD rather than disease
association, they excluded all SNP pairs located in linked gene pairs (genes
on the same chromosome and separated by less than 5Mbp).
After filtering, the number of SNP pairs for each disease was between
3,107,904 and 3,850,339.
Correction for multiple
testing were used using
Bonferroni type
correction with number
of effective SNP pairs in
a gene pair.
Quantile-quantile plots
were constructed by
plotting the order
statistics from a set of
values against their
expected values obtained
from the theoretical
distribution under the
null distribution.
95% concentration band
was calculated with
10,000 simulations
assuming SNP pairs are
independent.
Results:
Powerful statistical procedure: From distribution of 10,000 random SNP
pairs, expected to follow null hypothesis of non-interaction, this interction
statistic, calculated as likelihood ratio, followed a chi-square distribution
with 4 degrees of freedom: type I error rate (false positive) at 1, 5, 10 and
20% levels were 0.94, 4.8, 10 and 20% respectively.
Effective comparison method over Bonferroni correction: Type I error
rate at the 5% level shows that a Bonferroni correction is overly
conservative; probability of rejecting the null hypothesis of non-interaction
to be 0.8%. Using effective number of pairs gave abetter correction:
probability of rejecting the null hypothesis at a 5% level is 4.5%.
This study proves that LD structure within genes induces dependency
between SNP pairs, lowering the power to detect epistasis.
From this simulation study and the analysis of case-control data of WTCCC,
they concluded that the number of effective tests is approximately six times
lower than the total number of tests.
Overall results from WTCCC analysis:
Total SNP pairs tested per disease: 3,107,904 - 3,850,339
Approx. two-third of SNP pairs were removed by quality filter from total
selected SNP pairs of 10,700,176.
The analysis of one data set took 130-160h, corresponding to an average of
25,000 tests per hour, on typical computer.
In comparison, testing all possible pairs (125 billion pairs) would take 570
years on a single computer!
From STRING database, 71000 protein-protein interactions were used.
Significant at 5% level
The marginal effect of an allele is a statistical measure of the impact of that
particular allele averaged over the phenotypes of all individuals who have a
genotype that includes that allele (Sing and Davignon 1985; Sing et al 1996).
Relative risk = affected frequency/non-affected frequency.
Crohn’s disease:
Significant interaction
for 8 SNP pairs – 2 SNPs
from APC region and 4
SNPs from IQGAP1
region.
APC – Adenomatous
Polyposis Coli (rs434157
on 5q22).
IQGAP1 – IQ-domain
GTPase-activating
protein 1 (rs6496669 on
15q26).
Crohn’s disease:
APC – Adenomatous Polyposis Coli (rs434157 on 5q22), MAF: 0.33
IQGAP1 – IQ-domain GTPase-activating protein 1 (rs6496669 on 15q26), MAF: 0.20
The joint OR that combined three at-risk genotypes((AG,GG), (AA,GA) & (AA,GG) is 1.85 (95%
CI: 1.45-2.37) and significantly larger than 1 (Fisher’s exact test, P = 8.88x10-7).
That means carrying at least three minor alleles combining rs6496669 and rs434157
elevates the risk for CD in the WTCCC data.
Discussion:
New method: (1) by focusing on potentially good SNP pair candidates, which take
part in a protein-protein interaction network, this method increases the significance
level, and true findings, missed by testing all pairs exhaustively, may be picked up
by this method.
(2) By accounting for the correlation between SNP pairs, this method has control
for multiple comparisons in a more efficient way than a Bonferroni correction.
(3) The proposed statistical procedure can detect a large variety of epistatic models
and allow for the detection of interaction between loci that do not display marginal
effects.
Four potential interaction found for CD, BD, HT and RA. But no significant
interaction for CAD, T1D & T2D
- This might be due to restriction of intragenic regions – such interactions are
either rare or that the statistical power is limited by the present sample sizes.
This approach is most powerful for identifying interaction of common SNPs
with very limited marginal effects, which are exactly the types of interaction
missed by other approaches based on marginal effects.
Considering a pair of variants where none of the SNPs are on the chip, the use
of tag-SNPs to detect the true interaction may fail, leading to a dramatic loss
of power even with data sets with thousands of individuals.
A similar statistical procedure can be designed to detect higher-order
interactions. However, such a test is most likely to be limited in terms of
power because of a higher degree of freedom and as one expects, very low
counts for an n-tuplet SNPs.