Disease-Associated Multi-SNP Combinations Search

Download Report

Transcript Disease-Associated Multi-SNP Combinations Search

COMBINATORIAL SEARCH METHODS FOR MULTI-SNP
DISEASE ASSOCIATION
Dumitru Brinza, Jingwu He and Alexander Zelikovsky
Human Genome and SNP
1
Length of Human Genome  3  109 base pairs
Difference between any two people  0.1% of genome
Total number of single nucleotide polymorphisms (SNP)  3  106
SNP - single nucleotide site where two or more different
nucleotides occur in a large percentage of population
0 = willde type/major (frequency) allele
1 = mutation/minor (frequency) allele
International HapMap project:
SNP maps are constructed across the human
genome with density of about one SNP per
thousand nucleotides.
HapMap tries to identify 1 million tag SNP’s
providing almost as much mapping information as entire 10 million SNP’s
Unfortunately, not as much known about SNP
combinations
HapMap initial budget was 100Million dollars
Due today around 1.5Million SNPs are typed
Most of the data are trio
Analysis of variation in suspected genes in disease and
nondisease individuals is aimed at identifying SNPs with
considerably higher frequencies among the disease
individuals than among the nondisease individuals
Most searches are done on a SNP-by-SNP basis
Recently two-SNP analysis shows promising results
(Marchini et al, 2005)
Multi-SNP analyses are expected to find even stronger
disease associations
Common diseases can be caused by combinations of
several unlinked gene (SNPs) variations
We address the computational challenge of searching for
such multi-gene causal combinations
Unadjusted p-value: Probability of case/control distribution in
a set defined by MSC, computed by binomial distribution
Multiple-testing adjusted p-value : randomization
Randomly permute the disease status of the population to generate 1000 instances.
Apply searching methods on each instance to get MSCs.
Compute the probability of MSCs that have a higher unadjusted p-value than the observed
p-value.
In our search we report only MSC with adjusted p-value < 0.05
Disease association analysis searches for a SNPs
or multi-SNP combinations with frequency
among disease individuals considerably higher
than among nondisease individuals.
Affymetrix GeneChip for gene genotyping ( 500k microarray chip )
0
0
0
0
0
0
0
Genetic epidemiology
Searching for genetic risk factors for diseases
Monogenic diseases
A mutated gene is entirely responsible for the disease
Typically rare in population: < 0.1%
Practically all cases are already reported
1
1
0
1
0
1
1
1
1
1
1
1
0
1
0
1
0
1
0
0
0
1
0
0
1
1
1
1
2
2
0
2
2
1
2
1
0
0
0
1
0
0
0
0
2
0
0
0
0
2
1
1
1
2
2
2
sick
sick
sick
sick
sick
healthy
healthy
3
Bonferroni is too crude (e.g., 3-SNP combinations among 100 SNPs, p < 0.05×10-6)
We adjust resulted p-values via randomization
The number of multi-SNP combinations is infeasible high (3100 for 100 SNPs).
How to find associated multi-SNP combinations without total checking?
High-throughput genotyping technology
Our contributions
A novel combinatorial method for finding diseaseassociated multi-SNP combinations was developed.
Multi-SNP combinations significantly associating with diseases
were found.
MSC
x x 1 x x 2 x x x
4 sick : 1 healthy
For Crohn's disease data (Daly, et al., 2001), a few associated multi-SNP combinations
with multiple-testing-adjusted to p < 0.05 were found, while no single SNP or pair of
SNPs showed significant association.
For a dataset for an autoimmune disorder (Ueda, et al., 2003), a few previously
unknown associated multi-SNP combinations were found.
For tick-borne encephalitis virus-induced disease, a multi-SNP combination within a
group of genes showing a high degree of linkage disequilibrium significantly associated
with the severity of the disease was found.
check significance
Statistical significance
Complex diseases
Affected by the interaction of multiple genes
Significance of risk factor is usually measured by Risk Rate or _ _
_Odds Ratio
We measure significance by the p-value of the set of genotypes
_defined by risk factor
Proposed searching methods
2
Disease association analysis
If the reported SNP is found among 100 SNPs then the
probability that the SNP is associated with a disease by
mere chance becomes 100 times larger (Bonferroni).
4
Exhaustive Search (ES):
In order to find a multi-SNP combination with the p-value of the
frequency distribution below 0.05, it checks all one-SNP, twoSNP, ..., m-SNP combinations.
Runtime is O(n3m) making complete searching unfeasible even for small numbers of SNPs m
We restrict searching to 1,2,3,4,5 SNPs
Searching level – number of SNPs which participate in MSC
Multi-SNP combination (MSC) define a set of disease and
nondisese individuals
MSC is considered statistically significant if the frequency of
disease and nondisese distribution has p-value < 0.05
A lot of reported findings are frequently not reproducible on
different populations. It is believed that this happens because
the p-values are unadjusted to multiple testing
Disease-closure allow finding of the statistically
significant MSC on the earlier stage of searching.
Trivial MSCs and MSCs which coincide after diseaseclosure are avoided. That significantly speedups the
searching.
5
Faster than ES
Finds more significant association on the early stage of searching
Still slow for wide-genome studies
Searching level – number of SNPs which define MSC before disease-closure
Indexed Exhaustive Search (IES):
Exhaustive search on the indexed datasets obtained by extracting
k indexed SNPs with MLR based tagging method.
MLR - multiple linear regression based tagging method (He and
Zelikovsky, 2006).
Indexed Combinatorial Search (ICS):
Combinatorial search on the indexed datasets obtained
by extracting k indexed SNPs with MLR based tagging
method.
Can perform complete searching for the larger datasets
Data Sets
Crohn's disease : 387 genotypes with 103 SNPs derived from the 616 KB
region of human Chromosome 5q31, 144 disease genotypes and 243 nondisease
genotypes. (Daly et al., 2001).
Autoimmune disorder : 1024 genotypes with 108 SNPs containing gene CD28,
CTLA4 and ICONS, 378 disease genotypes and 646 nondisease genotypes. (Ueda
et al., 2003).
The tradeoff between the number of chosen indexing SNPs and
quality of reconstruction requires choosing the maximum number
of index SNPs that can be handled by ES in a reasonable
computational time.
Can perform complete searching for the larger datasets
For wide-genome study number of tags can’t be reduced to 5-10 tags. Therefore, IES will
not be able to perform complete search
Combinatorial Search (CS):
Similar to ES check all one-SNP, two-SNP, ..., m-SNP diseaseclosed combinations.
Disease-closure of a multi-SNP combination C is a multi-SNP
combination C’, with maximum number of SNPs, which consists
of the same set of disease individuals and minimum number of
nondisease individuals.
Tick-borne encephalitis : 75 genotypes with 41 SNPs containing gene TLR3,
PKR, OAS1, OAS2, and OAS3, 21 disease genotypes and 54 nondisease
genotypes. (Barkash et al., 2006).
Discussion
Comparing indexed counterparts with ES and CS shows that indexing is quite successful.
Indeed, the indexed searches found the same multi-SNP combinations as the non-indexed
searches but were much faster and the multiple-testing adjusted 0.05-threshold was higher
and easier to meet.
Comparing the CS with the ES counterparts is advantageous to the former. Indeed, for the
Crohn's disease data (Daly.et al., 2001), the ES on the first and second search levels is
unsuccessful while the CS finds several statistically significant multi-SNP combinations.
Similarly, for the tick-borne encephalitis virus-induced disease data, the CS and ICS(20)
found a significant association on the first level while no association was found by the ES or
IES(20). For the autoimmune disorder data (Ueda.et al., 2003), the CS found many more
statistically significant multi-SNP combinations then the ES.
We conclude that the proposed indexing approach and the combinatorial search method are
very promising techniques for searching for statistically significant diseases-associated
multi-SNP combinations and disease susceptibility prediction.
Disease-Associated Multi-SNP Combinations Search
Given: a population of n genotypes (or haplotypes) each
containing values of m SNPs from {0,1,2} and disease status
(diseased or nondisease)
Find: all multi-SNP combinations with multiple testing adjusted
p-value of the frequency distribution below 0.05
Results/comparison of searching methods
6
The relative qualities of the searching methods are compared
using the number of statistically significant multi-SNP
combinations found.
The statistical significance was adjusted to multiple testing and
the adjusted 0.05 threshold is shown (third column).
In the 4th, 5th and 6th columns, we give the frequencies of the
best multi-SNP combination among disease and nondisease
populations and the unadjusted p-value, respectively.