ibm_rochester_talk_may_2005

Transcript ibm_rochester_talk_may_2005

Data Mining for Finding Connections of Disease
and Medical and Genomic Characteristics
Vipin Kumar
William Norris Professor and
Head, Department of Computer Science
[email protected]
www.cs.umn.edu/~kumar
Michael Steinbach
Department of Computer Science
Increasing Amounts of Medical and
Genomic Data

Electronic medical records are becoming
increasingly common
– Automated analysis of patient information is now
possible

Obtaining genomic information is increasingly
affordable
– SNPs offer the potential of tests for disease or
susceptibility for disease
© Vipin Kumar
May 18, 2006
‹#›
Various Techniques Currently Used to Analyze
the Relationship Between SNPs and Disease

Statistical Association Analysis
– Chi-Square and Odds ratio

Logistic Regression

Multifactor Dimensionality Reduction
– Attribute selection and construction followed by
classification
© Vipin Kumar
May 18, 2006
‹#›
Challenges Facing Current Techniques

Statistical Association Analysis
– Often lacks power, even for single SNP association
– Combinatorial explosion when used for screening pairs, triples, etc.

Logistic Regression
– Does not work well when many attributes

Multifactor Dimensionality Reduction
– Impractical for many attributes due to combinatorial nature

General Challenges
– Disease or susceptibility to disease often results from interactions
among many genetic, phenotypic, and environmental factors
– Noise and nonlinear interactions
– The Challenges of Whole-Genome Approaches to Common
Diseases, Moore and Ritchie, JAMA, 2004
© Vipin Kumar
May 18, 2006
‹#›
General Approach Using
Data Mining Techniques

Create a data set that records the presence and absence of
– Phenotypic characteristics
– Genetic characteristics (SNPs)
– Disease

Apply association analysis to find groups of phenotypic and
genetic characteristics that are highly associated with
disease
– Uses characteristics of the patterns to prune the search space

Clustering and classification can also be applied
© Vipin Kumar
May 18, 2006
‹#›
Traditional Association Analysis

Association analysis: Analyzes
relationships among items (attributes) in
a binary transaction data
– Example data: market basket data
– Data can be represented as a binary
matrix
Set-Based Representation of Data
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
– Applications in business and science

Example: Milk  Diaper
© Vipin Kumar
May 18, 2006
Beer
Eggs
Coke
– Association Rules: X  Y, where X
and Y are itemsets.
1
2
3
4
5
Diapers
– Itemsets: Collection of items
 Example: {Milk, Diaper}
Milk
Two types of patterns
Bread

1
1
0
1
1
1
0
1
1
1
0
1
1
1
1
0
1
1
1
0
0
1
0
0
0
0
0
1
0
1
Binary Matrix Representation of Data
‹#›
The Need for Error-Tolerant Itemsets

An error-tolerant itemset (ETI) can have a fraction  of the
items missing in each transaction.
Example: see the data in the table
– Let  = 1/4. In other words, each
transaction needs to have
3/4 (75%) of the items.
– X = {i1, i2, i3, i4} and
Y = {i5, i6, i7, i8} are both
ETIs with a support of 4.

Algorithms to find ETIs are still in development

You can think of these ETIs as blocks in the data matrix
© Vipin Kumar
May 18, 2006
‹#›
ETIs in For Finding Patterns in
Phenotypic and Genomic Data

ETIs consist of
– A set of patients and
– A set of attributes
such that
– The block is relatively dense


These blocks identify sets of
patients that are highly associated with
certain sets of attributes and vice-versa
If most of these patients share a disease,
then these attributes (genetic and/or
phenotypic) are candidate markers for the
disease
© Vipin Kumar
May 18, 2006
X: Set of patients
Y: Set of attributes,
i.e., SNPs, medical
characteristics
‹#›
Example: Using ETIs to Find Rules


Can use ETIs to find better association rules
Example: Mushroom data set
– Classifies 8192 mushrooms as poisonous (3916) or not (4208)
– 117 other attributes, such as color, odor, etc.

Comparison of rules based on frequent itemsets or ETIs
– One rule is {29, 48, 90} → p
 Individual
correlations are 0.62, 0.54, 0.18
 With traditional association analysis, this rule has a
confidence of 1 and a support of 576.
– This corresponds to a correlation of 0.18
 If we require only two of the three items, we still have a
confidence of 1, but the support is 3312.
– This corresponds to a correlation of 0.86
– Maximum single item correlation is 0.78
© Vipin Kumar
May 18, 2006
‹#›
Generalizing ETIs to Blocks
in a Data Matrix

Dense blocks consist of
– A set of objects and
– A set of attributes
such that
– The block is relatively dense


These blocks identify sets of
objects that are highly associated with
certain sets of attributes and vice-versa
For gene expression data, this identifies
transcription modules
– Group of genes that are coexpressed together
under a set of conditions
© Vipin Kumar
May 18, 2006
X: Genes
Y: Conditions
under which genes
are expressed
Entries of matrix
are the level of a
gene’s expression
under a condition
‹#›
Techniques for Finding Relatively Dense
Blocks in a Data Matrix

Algorithms for finding Error-Tolerant Itemsets
– More work needed to develop algorithms
– We are currently using support envelopes
– Support Envelopes: A Technique for Exploring the Structure of
Association Patterns, Steinbach, Tan, Kumar, KDD, 2004

Subspace clustering
– Similar in spirit to association analysis
– Subspace clustering for high dimensional data: A review, Parsons ,
Haque , Liu SIGKDD Explorations, 2004

Co-clustering
– Information-Theoretic Co-Clustering, Dhillon, Mallela, Modha, KDD 2003
– We are currently exploring approaches to extend co-clustering for ETIs

Variety of other approaches
– Matrix factorization
– Graph-partitioning
© Vipin Kumar
May 18, 2006
‹#›
Support Envelopes
• A support envelope contains all association patterns involving m or
more transactions and n or more items
• By association patterns we mean
• Itemsets and variants (frequent, maximal, closed)
• Error Tolerant Itemsets (ETIs)
An example of a support envelope
involving characteristics of
mushrooms. One of the attributes is,
‘gill-color:buff’, which occurs in 1728
records, every one of which occurs
with 13 other items (one of which is
the attribute,‘poisonous’)
© Vipin Kumar
May 18, 2006
‹#›
Visualizing Support Envelopes for Mushroom

One of the support envelopes (576, 23) is denser than its
surrounding neighbors.
© Vipin Kumar
May 18, 2006
‹#›
Using Weak Associations to Find Patterns

Apply the following principle:
If most pairs of attributes or objects in a set have a pairwise
connections, then there is likely to be a strong association
among them even if the pairwise associations are weak.

The hyperclique pattern uses this principle
– Hyperclique Pattern Discovery, Xiong Tan, and Vipin Kumar, to
appear DMKD, 2006.
– Good for removing noise


Enhancing Data Analysis with Noise Removal, Hui Xiong, Gaurav
Pandey, Michael Steinbach, Vipin Kumar, TKDE, 2006
We have recently developed more general pairwise
patterns
– Custom Itemset Patterns, Steinbach and Kumar, submitted to KDD
2006
© Vipin Kumar
May 18, 2006
‹#›
Application of Classification Techniques

Techniques must work with noisy, sparse, highdimensional data

Success of multifactor dimensionality reduction
indicates the usefulness of attribute selection and
attribute creation

ETI and related patterns offer an alternative for feature
extraction

Classification based on association analysis

SVM with the proper kernel
© Vipin Kumar
May 18, 2006
‹#›

ibm_rochester_talk_may_2005

Transcript ibm_rochester_talk_may_2005

Directory