Combinatorial Methods for Disease Association Search and

Download Report

Transcript Combinatorial Methods for Disease Association Search and

Combinatorial Methods
for Disease Association Search
and Susceptibility Prediction
Alexander Zelikovsky
joint work with Dumitru Brinza
Department of Computer Science
Outline


SNPs, Haplotypes and Genotypes
Disease Association Search





Predicting susceptibility to complex diseases




Genome-wide association search challenges
Problem formulation
Exhaustive & Combinatorial Search
Optimization formulation & complimentary greedy search
Problem formulation/cross-validation
Previous methods: SVM, RF, LP
Optimum clustering and prediction via model-fitting
Conclusions
2
SNP, Haplotypes, Genotypes









Length of Human Genome
 3  109
#Single nucleotide polymorphism (SNPs)  1  107
SNPs are mostly biallelic, e.g., AC
Minor allele frequency should be considerable e.g. >.1%
Difference b/w ALL people  0.25% (b/w any 2  0.1%)
Diploid = two different copies of each chromosome
Haplotype = description of a single copy (expensive)
 example: 00110101 (0 is for major, 1 is for minor allele)
Genotype = description of the mixed two copies
 example 01122110 (0=00, 1=11, 2=01)
International Hapmap project: www.hapmap.org
3
Challenges of Disease Association


Monogenic disease
 A mutated gene is entirely responsible for the disease .
 Typically rare in population: < 0.1%.
Complex disease
 Interaction of multiple genes


Multiple independent causes


Each cause explains < 10-20% of cases
Common: > 0.1%.


2-SNP interaction analysis for a genome-wide scan with 1
million SNPs (3 kb coverage) has 1012 pairwise tests
In NY city, 12% of the population has Type 2 Diabetes
Multiple testing adjustment
 Reason for non-reproducible findings
4
Disease Association Search Problem


Disease
SNPs
Status
0101201020102210 -1
Non-diseased 0220110210120021 -1
genotypes: H 0200120012221110 -1
0020011002212101 -1
Sample population S
1101202020100110 1
of individual genotypes
Diseased
0120120010100011 1
genotypes: D 0210220002021112 1
0021011000212120 1
Risk/resistance factor = multi-SNP combination (MSC)



Cluster C= subset of S with an MSC,


a subset of SNP-columns of S
the values of these SNPs, 0, 1, or 2
d(C) = diseased, h(C) = non-diseased
PROBLEM: Find all MSCs significantly associated with the disease
5
Significance of Risk/Resistance Factors

Measured by
 Relative risk (RR)
 Odds ratio (OR)
 Their p-values

Unadjusted p-value: Probability of case/control distribution among
exposed to risk factor, computed by binomial distribution

Multiple-testing adjustement:
 Bonferroni
 easy to compute
 overly conservative
 Randomization
 computationally expensive
 more accurate
6
Exhaustive & Combinatorial Search

Exhaustive search is infeasible


sample with n genotypes/m SNPs requires O(n3m)
Combinatorial search
Definition: Disease-closure of a multi-SNP combination C is a multi-SNP
combination C’, with maximum number of SNPs, which consists of the same
set of disease individuals and minimum number of nondisease individuals.

Searches only closed clusters



Avoids checking of trivial MSCs



Closure of cluster C = C’
d(C’)=d(C) and h(C’) is minimized
Small d(C) implies not looking in subclusters
Finds faster associated MSCs but still too slow
Tagging:



compress S by extracting most informative SNPs
restore other SNPs from tag SNPs
multiple regression method
7
MLR Tagging
Stepwise Index SNP Algorithm:



Choose as a tag the SNP which best predicts all other SNPs
Choose the next one which together with a first best predicts all other SNPs
and so on.
Prediction method is based on Multiple Linear Regression (MLR)
So far beats in quality other methods (STAMPA)
8
Data Sets

Crohn's disease (Daly et al ): inflammatory bowel disease (IBD).
Location: 5q31
Number of SNPs: 103
Population Size: 387
case: 144 control: 243

Autoimmune disorders (Ueda et al) :
Location: containing gene CD28, CTLA4 and ICONS
Number of SNPs: 108
Population Size: 1024
case: 378 control: 646

Tick-borne encephalitis dataset of (Barkash et al) :
Location: containing gene TLR3, PKR, OAS1, OAS2, and OAS3.
Number of SNPs: 41
Population Size: 75
case: 21 control: 54
9
Disease association
search results

IES(30):



exhaustive search
30 indexed SNPs with
MLR based tagging
method
ICS(30):


combinatorial search
30 indexed SNPs with
MLR based tagging
method.
10
Disease Association Search

Optimum Association Search Problem:



Bad news: Generalization of max
independent set


Find MSC that is the most associated with
the disease
Measure: positive predictive value
= find (non-)diseased-free cluster of
maximum size
NP complete and cannot be well
approximated
Hope: sample S is not arbitrary
11
Complimentary Greedy Algorithm

Algorithm


Start with C=S (resp. MCS is empty)
Repeat until h(C)=0 (non-diseased-free)





Find 1-SC s maximizing (h(C)-h(C  {s})) / (d(C) – d(C 
{s})) = minimize payment with diseased for removal of
non-diseased
Add s to SNPs of C’s MSC
Analogy: finding independent set by greedy
removing highest degree vertecies
Extremely fast but inaccurate
Can be used in susceptibility prediction
12
Most disease-associated & disease-resistant MSC
Comparison of three methods for searching the disease-associated and diseaseresistant multi-SNPs combinations with the largest PPV. The starred values refer to
results of the runtime-constrained exhaustive search
13
Genetic Susceptibility Prediction

Given: Genotypes of diseased and non-diseased individuals,
Genotype of a testing person.
Find: The disease status of the testing person
healthy
sick
testing - gt
Genotype
Disease Status
0101201020102210
0220110210120021
0200120012221110
0020011002212101
1101202020100110
0120120010100011
0210220002021112
0021011000212120
-1
-1
-1
-1
1
1
1
1
0110211101211201
s(gt)
14
Cross-validation

Leave-one-out test: The disease status of each genotype
in the data set is predicted while the rest of the data is
regarded as the training set.
Genotype
0101201020102210
0220110210120021
0200120012221110
0020011002212101
0020011002212101

Real
Disease Status
-1
-1
-1
1
1
Predicted
Disease Status
-1
-1
1
1
1
Accuracy = 80%
Leave-many-out test: Repeat randomly picking 2/3 of
the population as training set and predict the other 1/3.
15
Quality Measures of Prediction (confusion table)
Disease
Prediction
+
-
True Positive
False Positive
(TP)
(FP)
False Negative
True Negative
(FN)
(TN)
+
Test
-




Sensitivity: The ability to correctly detect disease.
sensitivity = TP/(TP+FN)
Specificity: The ability to avoid calling normal as disease.
specificity = TN/(FP+TN)
Accuracy = (TP +TN)/(TP+FP+FN+TN)
Risk Rate: Measurements for risk factors.
16
Prediction Methods



Support vector machine
Random forest
LP-based prediction
17
Prediction via Clustering


Drawback of the prediction problem formulation =
need of cross-validation  no optimization
Clustering P = partition into clusters defined by
MSC’s



Minimizing number of errors
S.t. bounded information entropy –∑(Si/S) log(Si/S)
Model-fitting prediction





Set status of testing genotype to diseased
Find number of errors
Set status of testing genotype to diseased
Find number of errors
Predict status that implies lesser number of errors
18
Leave-1-out cross-validation results
Leave-one-out cross-validation for combinatorial search-based prediction
(CSP) and complimentary greedy search-based prediction (CGSP) are
given when 20, 30, or all SNPs are chosen as informative SNPs.
19
ROC curve
Comparison of 5 prediction methods on (Barkash et. al,2006 ) data on all SNPs.
Area under the CSP’s curve is 0.81 vs 0.52 under the SVM’s curve.
20
Conclusions





Combinatorial search is able to find
statistically significant multi-gene interactions,
for data where no significant association was
detected before
Complimentary greedy search can be used in
susceptibility prediction
Optimization approach to prediction
New susceptibility prediction is by 8% higher
than the best previously known
MLR-tagging efficiently reduces the datasets
allowing to find associated multi-SNP
combinations and predict susceptibility
21
International Symposium
on Bioinformatics Research and Applications
May 6-9, 2007, Georgia State University, Atlanta, Georgia
http://www.cs.gsu.edu/ISBRA/
ISBRA provides a forum for the exchange of ideas and results
among researchers, developers, and practitioners working
on all aspects of bioinformatics and computational biology and their applications
Submissions must and must not exceed 12 pages in Springer LNCS style
The proceedings of ISBRA 2007 will be published in LNBI
Important Dates
Submission Deadline
Notification of Acceptance
Final Version Submission
December 20, 2006
January 31, 2007
February 21, 2007
Symposium Organizers
General Chairs: Dan Gusfield (University of California, Davis) and
Yi Pan (Georgia State University)
• Program Chairs: Ion Mandoiu (University of Connecticut) and
Alexander Zelikovsky (Georgia State University)
•
22
History of ISBRA
ISBRA is the successor of the International Workshop on Bioinformatics Research
and Applications (IWBRA), held on
- May 22-25, 2005 in Atlanta, GA and
- May 28-31, 2006 in Reading, UK
in conjunction with the International Conference on Computational Science
The two editions of IWBRA have enjoyed a great success, with special issues
devoted to full versions of selected papers in
Springer LNCS Transactions on Computational Systems Biology and
IEEE/ACM Transactions on Computational Biology and Bioinformatics
23
Support Vector Machine (SVM) Algorithm

Learning Task



Given: Genotypes of patients and healthy persons.
Compute: A model distinguishing if a person has the
disease.
Classification Task


Given: Genotype of a new patient + a learned model
Determine: If a patient has the disease or not.
Linear SVM
Non-Linear SVM
24
Random Forest Algorithm



Random Forests grows many classification trees. To
classify a new object from an input vector, put the
input vector down to each tree in the forest. Each
tree gives a classification, and we say the tree “votes”
for that class. The forest chooses the classification
having the most votes (over all the trees in the
forest).
Growing Tree, Split selection and Prediction.
Random sub-sample of training data, Random splitter
selection.
25
LP-based Prediction Algorithm

Model:




Certain haplotypes are susceptible to the disease while others are
resistant to the disease.
The genotype susceptibility is assumed to be a sum of
susceptibilities of its two haplotypes.
Assign a positive weight to susceptible haplotypes and a negative
weight to resistant haplotypes such that for any control genotype
the sum of weights of its haplotypes is negative and for any case
genotype it is positive.
For each vertex-haplotype hi assign the weight pi,
such that for any genotype-edge ei,j =(hi,,hj )
where s(ei,j )  {-1,1} is the disease status of genotype ei,j.
The sum of absolute values of genotype weights is
maximized.
26