DTC-symposium-2008-poster

Transcript DTC-symposium-2008-poster

Data Mining Techniques For Correlating Phenotypic
Expressions With Genomic and Medical Characteristics
Rohit Gupta, Blayne Field, Michael Steinbach, Vipin Kumar, Rich
*
Mushlin ,
Fred
+
Kulack
Department of Computer Science and Engineering, University of Minnesota (200 Union Street SE, Minneapolis MN 55455 USA)
*IBM T. J Watson Research Center, +IBM Rochester
e-coords: [email protected], [email protected]
INTRODUCTION
METHODS
Project Motivation
Association Analysis
• Obtaining genomic information is increasingly
affordable
 Data Mining-based association analysis is
applied to find patterns that capture the
connections between SNPs and disease
o Single Nucleotide Polymorphisms (SNPs) offer the
potential to tests for disease or susceptibility for
disease
• Electronic medical records (EMRs) are
becoming increasingly common
o Automated analysis of patient information is now
possible
• This revolution in genetic and medical
potentially leads to Personalized medicine,
i.e., using detailed genomic and medical
information about a person for the detection,
treatment, or prevention of disease
Data Set
• Genetic data (SNPs)
o Frequent closed itemsets capture SNP patterns
where all SNPs must be present
o Error-tolerant itemsets (ETIs) capture more
general SNP patterns, where not all SNPs need
to occur in all patients defining the pattern
o Existing techniques includes statistical
association analysis, Logistic Regression,
Multifactor Dimensionality Reduction, CART,
Random Forests, etc
 Based on the disease variable, patients are
categorized as cases or controls.
  = 1/4. In other words, each
transaction needs to have 3/4 (75%)
of the items
 First, we find patterns (closed itemsets or
ETIs) in cases and then check for their
presence in control patients. Odds Ratio (OR)
and P-value metrics (as described below) are
used to evaluate the identified patterns
 {i1, i2, i3, i4} and {i5, i6, i7, i8} are both
ETIs with a support of 4
Find
strong
patterns in
cases
o Simulated SNP data using known models has
been used for this study. Approximately, 2000
cases and 6000 control records have been
generated
Cases
With
Pattern
Without
Pattern
Column
Margins
Patients Genetic Information (SNPs) as Binary Matrix and
disease (Yes/No) as Class Label.
Problem Formulation
 Given: A patient data set that records
o Phenotypic Expression (Disease)
o Genetic characteristics
o Medical characteristics
 Objective: Finding patterns combining
medical and genetic characteristics that best
defines the phenotypic expression under
study
 Challenges:
o High dimensionality and low sample size
o Combinatorial explosion
o Noise
o Non-linear interactions
Evaluate
strength of
patterns in
controls
Rank all the
patterns
using OR
and p-value
to obtain
final results
Figures of Merit for 2 x 2 table
b
a
c
Ncases
Row
Margins
Controls
d
Nwithout
Ncontrols
Ntotal
 We use odds ratio (OR), and Pvalue (P)
o OR quantifies how different are cases
and controls for a specific pattern
o P quantifies the significance of the
difference reflected by OR
• Techniques for finding closed itemsets have proven effective for finding
SNP patterns in synthetic data
• Algorithms exist for finding ETIs have shown promise, but the
evaluation is not complete
• Odds Ratio and P-value are found to be the best indicator of real
patterns for synthetic SNP data. They are also found to be highly
correlated to other similarity measures
Hypergeometric Distribution
References
0
-50
-100
500,500,500,500
500,500,250,750
250,750,250,750
250,750,100,900
-150
-200
-250
-300
 Odds Ratio, OR = a*d / b*c
 P is the probability of a table (shown
above) with the same fixed margins
having a higher (or same) OR

a  aobs
-log10(pvalue)
5.452
3.935
3.770
3.739
3.661
3.541
3.503
3.448
3.421
3.414
• Computational demands of the algorithms are high
 There are many different figures of
merit (FOM), i.e. functions of a, b, c,
d, that can be used to characterize
the table
P
Odds Ratio
5.442
1.661
3.002
3.845
1.934
2.844
1.965
2.177
1.682
2.486
• Various association analysis algorithms have been applied to find
connections between genetic characteristics (SNPs) and disease
Evaluation Measures
a  Ncases
Itemset
aa1 aa2 aa3 aa4
Aa1 aa2 aa4 Aa8
aa1 Aa2 aa3 AA5 AA6
Aa1 aa2 AA5 AA6 AA7 Aa8
aa1 aa2 AA7 Aa8
aa1 aa2 aa3 AA5
aa1 aa3 AA5 AA6
aa2 aa3 AA5 Aa7 Aa8
aa2 aa3 AA5 Aa7
aa1 aa3
Conclusions
a, b, c, and d are
the number of cases
with the pattern,
controls with the
pattern, cases
without the pattern,
and controls without
the pattern,
respectively.
Nwith
Log(p)
o Real SNP data for Parkinson’s and Myeloma
disease.
RESULTS AND DISCUSSIONS
-6
-4
-2
0
2
4
6
Log(OR)
Probability distribution, p, as a function of odds ratio, OR,
for Ntotal = 1000 and several sets of margins (Full range of
points is shown). The margins in the legend are in the order
Ncases, Ncontrols, Nwith, Nwithout
N cases !* N controls !* N with !* N without !
a !* b !* c !* d !* N total
http://www-users.cs.umn.edu/~kumar/dmbio/index.html
• R. Mushlin, A. Kirshenbaum, S. Gallagher, T. Rebbeck, A graph-theoretical approach for pattern
discovery in epidemiological research, IBM Systems Journal 46, No. 1, 135-149 (2007)
• Jason H. Moore; Marylyn D. Ritchie, The Challenges of Whole-Genome Approaches to Common
Diseases, JAMA 2004 291: 1642-1643
• L. Bastone, M. Reilly, D. L. Rader, and A. S. Foulkes, MDR and PRP: A Comparison of Methods
for High-Order Genotype-Phenotype Associations, Human Heredity 58, No. 2, 2-92 (2004)
• A. S. Foulkes, M. Reilly, L. Zhou, M. Wolfe, and D. J. Rader, Mixed Modeling to Characterize
Genotype Phenotype Associations, Statistics in Medicine 24, No. 5, 775-789 (2005)
• A. Hattersley and M. McCarthy, What makes a good genetic association study? The Lancet,
Volume 366, Issue 9493, Pages 1315-1323, Oct. 2005
• Seppänen, J. K. and Mannila, H. 2004. Dense itemsets. In Proceedings of the Tenth ACM
SIGKDD international Conference on Knowledge Discovery and Data Mining (Seattle, WA, USA,
August 22 - 25, 2004). KDD '04. ACM Press, New York
• Tan, P.-N., Steinbach, M. and Kumar, V., Introduction to Data Mining, Pearson Addison-Wesley,
May 2005
Acknowledgements
This work has been supported by DTC, IBM and NSF grant and Computational resources for this
work were provided by the Minnesota Supercomputing Institute.

DTC-symposium-2008-poster

Transcript DTC-symposium-2008-poster

Directory