ibm_rochester_talk_may_2005
Download
Report
Transcript ibm_rochester_talk_may_2005
Data Mining for Finding Connections of Disease
and Medical and Genomic Characteristics
Vipin Kumar
William Norris Professor and
Head, Department of Computer Science
[email protected]
www.cs.umn.edu/~kumar
Michael Steinbach
Department of Computer Science
Increasing Amounts of Medical and
Genomic Data
Electronic medical records are becoming
increasingly common
– Automated analysis of patient information is now
possible
Obtaining genomic information is increasingly
affordable
– SNPs offer the potential of tests for disease or
susceptibility for disease
© Vipin Kumar
May 18, 2006
‹#›
Various Techniques Currently Used to Analyze
the Relationship Between SNPs and Disease
Statistical Association Analysis
– Chi-Square and Odds ratio
Logistic Regression
Multifactor Dimensionality Reduction
– Attribute selection and construction followed by
classification
© Vipin Kumar
May 18, 2006
‹#›
Challenges Facing Current Techniques
Statistical Association Analysis
– Often lacks power, even for single SNP association
– Combinatorial explosion when used for screening pairs, triples, etc.
Logistic Regression
– Does not work well when many attributes
Multifactor Dimensionality Reduction
– Impractical for many attributes due to combinatorial nature
General Challenges
– Disease or susceptibility to disease often results from interactions
among many genetic, phenotypic, and environmental factors
– Noise and nonlinear interactions
– The Challenges of Whole-Genome Approaches to Common
Diseases, Moore and Ritchie, JAMA, 2004
© Vipin Kumar
May 18, 2006
‹#›
General Approach Using
Data Mining Techniques
Create a data set that records the presence and absence of
– Phenotypic characteristics
– Genetic characteristics (SNPs)
– Disease
Apply association analysis to find groups of phenotypic and
genetic characteristics that are highly associated with
disease
– Uses characteristics of the patterns to prune the search space
Clustering and classification can also be applied
© Vipin Kumar
May 18, 2006
‹#›
Traditional Association Analysis
Association analysis: Analyzes
relationships among items (attributes) in
a binary transaction data
– Example data: market basket data
– Data can be represented as a binary
matrix
Set-Based Representation of Data
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
– Applications in business and science
Example: Milk Diaper
© Vipin Kumar
May 18, 2006
Beer
Eggs
Coke
– Association Rules: X Y, where X
and Y are itemsets.
1
2
3
4
5
Diapers
– Itemsets: Collection of items
Example: {Milk, Diaper}
Milk
Two types of patterns
Bread
1
1
0
1
1
1
0
1
1
1
0
1
1
1
1
0
1
1
1
0
0
1
0
0
0
0
0
1
0
1
Binary Matrix Representation of Data
‹#›
The Need for Error-Tolerant Itemsets
An error-tolerant itemset (ETI) can have a fraction of the
items missing in each transaction.
Example: see the data in the table
– Let = 1/4. In other words, each
transaction needs to have
3/4 (75%) of the items.
– X = {i1, i2, i3, i4} and
Y = {i5, i6, i7, i8} are both
ETIs with a support of 4.
Algorithms to find ETIs are still in development
You can think of these ETIs as blocks in the data matrix
© Vipin Kumar
May 18, 2006
‹#›
ETIs in For Finding Patterns in
Phenotypic and Genomic Data
ETIs consist of
– A set of patients and
– A set of attributes
such that
– The block is relatively dense
These blocks identify sets of
patients that are highly associated with
certain sets of attributes and vice-versa
If most of these patients share a disease,
then these attributes (genetic and/or
phenotypic) are candidate markers for the
disease
© Vipin Kumar
May 18, 2006
X: Set of patients
Y: Set of attributes,
i.e., SNPs, medical
characteristics
‹#›
Example: Using ETIs to Find Rules
Can use ETIs to find better association rules
Example: Mushroom data set
– Classifies 8192 mushrooms as poisonous (3916) or not (4208)
– 117 other attributes, such as color, odor, etc.
Comparison of rules based on frequent itemsets or ETIs
– One rule is {29, 48, 90} → p
Individual
correlations are 0.62, 0.54, 0.18
With traditional association analysis, this rule has a
confidence of 1 and a support of 576.
– This corresponds to a correlation of 0.18
If we require only two of the three items, we still have a
confidence of 1, but the support is 3312.
– This corresponds to a correlation of 0.86
– Maximum single item correlation is 0.78
© Vipin Kumar
May 18, 2006
‹#›
Generalizing ETIs to Blocks
in a Data Matrix
Dense blocks consist of
– A set of objects and
– A set of attributes
such that
– The block is relatively dense
These blocks identify sets of
objects that are highly associated with
certain sets of attributes and vice-versa
For gene expression data, this identifies
transcription modules
– Group of genes that are coexpressed together
under a set of conditions
© Vipin Kumar
May 18, 2006
X: Genes
Y: Conditions
under which genes
are expressed
Entries of matrix
are the level of a
gene’s expression
under a condition
‹#›
Techniques for Finding Relatively Dense
Blocks in a Data Matrix
Algorithms for finding Error-Tolerant Itemsets
– More work needed to develop algorithms
– We are currently using support envelopes
– Support Envelopes: A Technique for Exploring the Structure of
Association Patterns, Steinbach, Tan, Kumar, KDD, 2004
Subspace clustering
– Similar in spirit to association analysis
– Subspace clustering for high dimensional data: A review, Parsons ,
Haque , Liu SIGKDD Explorations, 2004
Co-clustering
– Information-Theoretic Co-Clustering, Dhillon, Mallela, Modha, KDD 2003
– We are currently exploring approaches to extend co-clustering for ETIs
Variety of other approaches
– Matrix factorization
– Graph-partitioning
© Vipin Kumar
May 18, 2006
‹#›
Support Envelopes
• A support envelope contains all association patterns involving m or
more transactions and n or more items
• By association patterns we mean
• Itemsets and variants (frequent, maximal, closed)
• Error Tolerant Itemsets (ETIs)
An example of a support envelope
involving characteristics of
mushrooms. One of the attributes is,
‘gill-color:buff’, which occurs in 1728
records, every one of which occurs
with 13 other items (one of which is
the attribute,‘poisonous’)
© Vipin Kumar
May 18, 2006
‹#›
Visualizing Support Envelopes for Mushroom
One of the support envelopes (576, 23) is denser than its
surrounding neighbors.
© Vipin Kumar
May 18, 2006
‹#›
Using Weak Associations to Find Patterns
Apply the following principle:
If most pairs of attributes or objects in a set have a pairwise
connections, then there is likely to be a strong association
among them even if the pairwise associations are weak.
The hyperclique pattern uses this principle
– Hyperclique Pattern Discovery, Xiong Tan, and Vipin Kumar, to
appear DMKD, 2006.
– Good for removing noise
Enhancing Data Analysis with Noise Removal, Hui Xiong, Gaurav
Pandey, Michael Steinbach, Vipin Kumar, TKDE, 2006
We have recently developed more general pairwise
patterns
– Custom Itemset Patterns, Steinbach and Kumar, submitted to KDD
2006
© Vipin Kumar
May 18, 2006
‹#›
Application of Classification Techniques
Techniques must work with noisy, sparse, highdimensional data
Success of multifactor dimensionality reduction
indicates the usefulness of attribute selection and
attribute creation
ETI and related patterns offer an alternative for feature
extraction
Classification based on association analysis
SVM with the proper kernel
© Vipin Kumar
May 18, 2006
‹#›