Transcript Document

Integrating domain knowledge with statistical
and data mining methods for high-density
genomic SNP disease association analysis
Dinu et al, J. Biomedical Informatics 40 (2007) 750-760
Pathway/SNP
-A software application that allows its user to utilize pathway
data in the analysis of high-density genomic SNP data derived
from disease association studies.
- The purpose is to analyze the underlying etiology of disease
through the integration of pathway information using statistical
and data mining approaches.
Background:
- Large scale genome-wide association (GWA) studies are now
available to identify genomic mutations associated with wide
range of diseases.
- Complex diseases, like, diabetes, hypertension, etc. are believed
to be caused by the interaction of multiple genes and
environmental factors.
- The number of mathematical operations required to assess the
association between multiple interacting genomic loci and disease
grows exponentially with the number of interacting SNPs.
- Various statistical approaches, like stepwise algorithm, varying
parameters, etc. are used to analyze these associations.
- Data mining approaches are used for multi-locus association
with traits.
Computational complexity for brute-force ‘full-scan’ interaction
analysis between all possible combinations of n genomic markers
and a disease is exponential in n.
For Affymetrix 100K SNP GeneChip,
m = 100,000 genomic markers
Full scan requires
# of marker interaction
# of tests
2
5.00 x109
3
1.66 x1014
4
4.16 x1018
5
8.33 x1022
Fastest supercomputer
can perform ~3.67x1014
flops/s
Conclusion:
- “One model fits all” approach is not optimal.
Pathway/SNP
– Designed as an exploratory tool which integrates pathway
information, gene annotation, and SNP location to identify the
pathways that are most strongly associated with disease.
Architecture: 3-tier architecture written in Java
1> Presentation tier – written in Java Server Pages
2> Logic tier – statistical and data mining algorithms in Java
3> Data tier – genotype, phenotype and annotation data stored in
heavily indexed relational database.
Biological Data
- Annotations for 561 pathways –
181 KEGG, 314 BioCarta and 66 GenMAPP human pathways.
- Gene annotation data – from NCBI Entrez Gene
- Affymetrix 100k and 500k GeneChip microarray annotation files
are preloaded in the database.
Relevant SNPs:
In a given biological pathway if SNPs are located within 10,000
base pairs (bp) of a pathway gene’s location, they are considered
as relevant.
Relevant Genes:
First gene list is extracted from a particular database then it is
augmented from literature and Entrez gene.
Algorithms:
1> Single SNP association with disease
- Chi square and Armitage’s trend test
2> Pathway association with disease
- U-statistics or data mining algorithms
3> Permutation-based statistical significance inference
- Bonferroni adjustment or False discovery rate (FDR)
Single SNP association with disease:
1> Chi square test
2> Armitage’s trend test
1 degree of freedom
More preferred
Allele-based:
Allele A
count
Allele B
count
Case
Control
Chi square test
Genotype-based:
2 degrees of freedom
AA
count
Case
Control
AB
count
BB
count
Armitage’s Trend Test
This test is performed of case vs. control having a ‘trend’ with
different models of association between a SNP and disease.
Additive interaction model: This model tests the association that
depend additively upon the risk or minor allele, 0 for homozygous
non-risk alleles, 1 for heterozygous alleles and 2 for homozygous
risk alleles.
Dominant model: tests the association of having at least one risk
allele in homozygous (1) or heterozygous (1) vs. no risk in
homozygous non-risk allele (0).
Recessive model: tests the association of having one homozygous
risk allele (1) vs. having at least one non-risk allele in homozygous
(0) or in heterozygous (0).
Armitage’s Trend Test statistic has 1 degree of freedom
U-statistics for pathway association with disease:
-Non-parametric algorithm that can simultaneously test the
association of multiple markers with disease, with only a single
degree of freedom.
- First measures a score over all markers for pairs of subjects (set
of SNPs) within each of the case and control groups. Genetic
scoring for a pair of subjects is measured by a “kernel” function,
like recessive, dominant and linear dosage.
- Then compares the average scores between cases and controls by
use of a global statistic with one degree of freedom instead of the
implicit many degrees of freedom when many markers are
analyzed.
- The resulting z-scores can be used to rank pathways and also to
calculate an approximate p-value.
Consider b as risk allele and a as non-risk allele
Data mining for pathway association with disease:
- Data mining classifiers (e.g., SVM, Random Forests, logistic,
tree-based) can be used to explore the association between
pathways and disease.
- The “percent correct” classification of cases and controls
estimated with the genotypes at the pathway SNPs can be used as a
statistic for measuring the association between pathways and
disease.
- Incorporated using Weka data mining program, classifiers are run
by default with a 10-fold cross validation.
Multiple testing corrections:
- It may be possible that a good test statistic value that we have
obtained would have occurred by chance alone. Multiple testing
corrections are designed to help one to ensure, if possible, that this
is not the case.
Bonferroni adjustments:
- The Bonferonni adjustment multiplies each individual p-value by
the number of times that same test was performed (the value of
markers tested).
-This value, which is quite conservative, seeks to estimate the
probability that this test would have come out this well by chance
at least once from all of the times this test was performed.
Statistical significance using permutation based FDR:
- The False Discovery Rate (FDR) option calculates the False
Discovery Rate for each statistical test selected. This is a test
which is itself based upon the p-values from the original tests.
- The
interpretation of the False Discovery Rate is “What would
the rate of false discoveries (false positives) be if I accepted ALL
of the tests whose p-value is at or below the p-value of this test?”
-The aim of the FDR procedure is to control at a desired level a
(e.g., 0.05) the proportion of type I errors (false positives) among
all significant results.
- Suppose m hypotheses are tested, and R of them are rejected
(positive results). Of the rejected hypotheses, suppose that V of
them are really null–that is, that V is the number of type I errors, or
false positive results. The False Discovery Rate is defined as
FDR = E(V/R | R > 0). P(R > 0),
that is, the expected proportion of false positive findings among all
rejected hypotheses times the probability of making at least one
rejection.
- This procedure may yield higher statistical power compared to family wise
error rate. Pathways with low FDR (e.g., below 0.05) are considered significant.
Using Pathway/SNP to analyze AMD data set:
- This data set contains 116,204 genome wide SNPs genotyped with Affymetrix
100k Gene Chip
- Case-control study of 146 caucasian individuals
- 50 controls and 96 cases with advanced AMD
- 50 patients with wet AMD (severe) and 46 patients with dry AMD.
- Initial analysis identifies a mutation in complement factor H (CFH) on
chromosome 1 to be strongly associated with AMD.
- Identified 46 genes (from KEGG & NCBI genome 35 version)
- Total 94 SNPs are relevant (within 10,000 bp).
- Armitage’s trend test with additive model and U-statistics with 5 kernels
(dominant, recessive, linear, quadratic, allele match) and 4 data-mining
algorithms (J48, Random Forests, SVM, Naïve Bayes) were performed.
- Patients were grouped in 4 categories: control vs. all cases (wet+dry), control
vs. wet AMD, control vs. dry AMD, dry AMD vs. wet AMD.
- Identified two additional pathway genes, C7 and MBL2:
- Explanation of the difference between progressing to dry AMD,
less severe form to wet AMD, more severe one
Lessons learned:
- The potential need for high performance computation to support
a tool like Pathway/SNP
- The need for permutation testing to evaluate the results of the
analysis
- Dealing with different versions of the biological data and
knowledge
- Why different analysis algorithms might work better with
different data sets and different diseases
- The complexity of the “clinical phenotype”