BiCobGaurav4

Download Report

Transcript BiCobGaurav4

Association Analysis Techniques for
Bioinformatics Problems
Vipin Kumar
University of Minnesota
[email protected]
www.cs.umn.edu/~kumar
Group members:
Gaurav Pandey, Gowtham Atluri, Gang Fang, Rohit Gupta,
Michael Steinbach
www.cs.umn.edu/~kumar/dmbio
BICoB 2009
April 8th, 2009
Data Mining for Biomedical Informatics
 Recent technological advances are
helping to generate large amounts
of genomic data
- Gene and protein sequences
- Gene-expression data
- Biological networks and phylogenetic
profiles
- Single Nucleotides Polymorphisms
(SNPs)
 Data mining offers potential solution
for analysis of large-scale data
• Prediction of the functions of anonymous
genes
• Identification and summarizarion of
functional modules
• Associations between genotypes and
phenotype of interest
Protein Interaction Network
Association Analysis
Set-Based Representation of Data
• Association analysis: Analyzes relationships
among items (attributes) in a binary transaction
data
– Example data: market basket data
– Applications in business and science
• Marketing and Sales Promotion
• Identification of functional modules from protein
complexes
• Noise removal from protein interaction data
• Two types of patterns
– Itemsets: Collection of items
• Example: {Milk, Diaper}
– Association Rules: X  Y, where X and Y are
itemsets.
• Example: Milk  Diaper
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Support, s 
# transacti ons that contain X and Y
Total transacti ons
Confidence , c 
# transacti ons that contain X and Y
# transacti ons that contain X
Association Analysis

Process of finding interesting patterns:
•
•
•
Find frequent itemsets using a support
threshold
Find association rules for frequent itemsets
Sort association rules according to
confidence
null
A

C
D
E
Support filtering is necessary
•
•
To eliminate spurious patterns
To avoid exponential search
-
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Support has anti-monotone property:
X  Y implies (Y) ≤ (X)
ABCD

B
Confidence is used because
of its interpretation as
conditional probability


Has well-known limitations
More advanced measures: Hconfidence
ABCE
ABDE
ACDE
BCDE
ABCDE
Given d items, there are 2d
possible candidate itemsets
Different Association patterns
• Traditional frequent pattern
• Hyperclique pattern
• Error-tolerant frequent pattern
Case
• Discriminative pattern
Controls
Association Analysis for Identification of
Functional Modules from Protein Complex Data
Complexes
Proteins
c1
p1, p2
c2
p1, p3, p4, p5
c3
p2, p3, p4, p6
 A protein complex is a group of two or
more associated proteins formed by
protein-protein interaction that is stable
over time
 Gavin et al (2002)’s data set contains 232
complexes covering 1361 proteins
Can be represented as a market-basket
matrix
 Hypercliques derived from this data can
be used to discover frequently occurring
groups of proteins in several complexes
Likely to constitute functional modules
Xiong et al, “Identification of Functional Modules in Protein Complexes via Hyperclique Pattern Discovery”, Proc. Pac.
Symp. Biocomput. , pp 540-549, 2005.
Function Coherence of Hypercliques
GO enrichment of {PRE2 PRE4 PRE5 PRE6 PRE8 PRE9 PUP3 SCL1}.
Association Analysis-based
Pre-processing of Protein Interaction Networks
• Protein interaction data faces significant data quality challenges
– Significant noise and incomplete [Hart et al, 2006]
– Adverse impact on accuracy of functional inferences [Deng et al, 2003]
• Approach: Transform graph G=(V,E,W) into G’=(V,E’,W’)
Input PPI
graph
Transformed PPI graph
where Pi and Pj are
connected if (Pi,Pj) is a
hyperclique pattern
• Tries to meet three objectives:
– Addition of potentially biologically valid edges
– Removal of potentially noisy edges
– Assignment of weights to the resultant set of edges that indicate their reliability
Pandey et al, “Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case
Study”, ACM SIGKDD 2007, pp 540-549 (Also selected for a Highlight talk at ISMB 2008).
Results on several S. cerevisiae interaction data sets
Accuracy of top-k predictions
Combined from 3 interaction data
sets (initial edge reliabilities from
expression data).
DIPCore (Highly reliable subset
of DIP database).
Results on simulated data sets show that this transformation is able to reduce the
effect of noise on function prediction significantly.
An Association Analysis Approach for Biclustering
•
Bicluster: Group of genes that show coherent
expression across a subset of conditions in a gene
expression data set.
–
•
Hystad
et. al.
2007
Shown to be more effective than hierarchical
clustering for finding functional modules [Prelic et al,
2006]
Four general classes of biclusters studied [Madeira et al, 2004]
Constant value
biclusters
Constant row
biclusters
Constant value biclusters
Constant evolution
biclusters
Biclustering Algorithms and Associated Issues
• Several biclustering algorithms have been proposed
–
–
–
–
Cheng & Church (2000) (CC): First approach for biclustering microarray data
ISA [Bergmann et al, 2003]
SAMBA [Tanay et al, 2004], Co-clustering [Dhillon et al, 2003]
xMotifs [Murali & Kasif, 2003], OPSM [Ben-Dor et al, 2003]
• Typically these approaches work in one of the two ways:
– Start with all rows and columns and work in a top-down fashion by eliminating
rows and columns based on some greedy strategy (CC).
– Start with a random seed of rows and columns and iteratively zero in on the
final set of genes and conditions (ISA).
• Face several challenges:
– Cannot find biclusters exhaustively due to heuristic nature.
– Typically find large biclusters due to the objective function being achieved
early and can miss small interesting biclusters.
Association Analysis for Biclustering
• Association patterns are biclusters!
– Each pattern supported by a subset of
transactions
• Can formulate a framework for
directly extracting such patterns
from real-valued data (microarray).
• Advantages: Exhaustive and efficient discovery of biclusters
• Disadvantages: Need to binarize/discretize the original real-valued
data set for association mining [Becquet et al, 2002; Creighton et al, 2003;
McIntosh et al, 2007], which causes a loss of information.
Range Support Patterns (RAP)
• Need an anti-monotonic measure to compute the significance of
patterns discovered directly from real-valued data.
• We propose Range Support: Sum of RS over all conditions.
• Constraints imposed:
– Consistency of expression values
– Same direction of expression
– These conditions satisfied over substantial number of conditions
• Range support is anti-monotonic!
– Can be used within an Apriori-like framework [Agrawal et al 1993;
1994] to find patterns.
– Implementation available at http://www.cs.umn.edu/vk/gaurav/rap.
Pandey et al, “Association Analysis for Real-valued Data: Definitions and Application to Microarray Data”, TR 08-007,
Deptt of Comp Sc & Engg, University of Minnesota.
Experimental Evaluation
• Patterns derived at various thresholds from Hughes et al
(2000)’s microarray compendium.
– Compared with ISA and CC biclusters.
– RAP3: RangeSupport=8, α=0.5; RAP5: RangeSupport=12, α=1.
• Evaluation using bicluster coherence and GO annotations
– Coherence measured using Mean Squared Error (Cheng & Church,
2000).
– Functional enrichment computed for each set of biclusters using
GOTermFinder’s methodology (Boyle et al, 2004).
• Fraction of patterns enriched measured at various p-value thresholds.
• Classes separated into big (31-500 members) and small (1-30 members).
• Evaluation restricted to 100 least overlapping patterns, to
reduce the effect of redundancy (Prelic et al, 2006).
Bicluster coherence using MSE
Cheng and Church, 2000
Examples of biclusters found
CC1 bicluster with the lowest MSE
RAP3 bicluster with highest MSE
ISA2 bicluster with the lowest MSE
RAP3 bicluster with maximum number of genes
GO enrichment results
Functional enrichment for large
classes (31-500 members)
Functional enrichment for small
classes (1-30 members)
Results from randomized patterns show that the results at the strictest p-value
thresholds are the most statistically significant
GO enrichment results for several sets of small functional classes
Fraction of patterns
(biclusters) enriched by
several groups of small
classes at p-value 1x10-5
Fraction of class covered by
patterns (biclusters) among
several groups of small
classes at p-value 1x10-5
Complementarity of RAP and ISA Biclusters
• {YAR010C,YBL005W-A,YBR012WA,
YJR026W,YJR028W,YML040W,YMR046C,YMR051C}
– Enriched by RNA-mediated transposition class (GO:0032197; pvalue=4.64×10−12).
– Not found or part of a large (138) ISA bicluster!
• {YDR158W,YER052C,YOR130C,YPR145W}
– Enriched by small classes GO:0009088 (6 members), GO:0009090
(3 members) and GO:0009092 (7 members).
– Embedded within large ISA biclusters of sizes 211 and 104.
– None of these classes were found to enrich these large biclusters!
• RAP patterns are complementary to ISA biclusters in terms
of finding much smaller functionally enriched biclusters.
Summary
• RAP patterns are coherent, as measured by their
MSE scores, among different sets of biclusters.
• RAP patterns are significantly more enriched by
small and specific functional classes as compared
to ISA and CC biclusters.
• Gene groups discovered by RAP are enriched for
functions not covered by biclusters discovered by
standard biclustering algorithms.
Discriminative Pattern Mining for Biomarker Discovery
• Motivation:
– Genes (or SNPs) may NOT be predictive individually, but
as a group they could be.
Initially proposed by
[Dong et al. 1999] and
[Bay and Pazzani,
2001]
• Goal:
– Searching for group of genes that collectively show differential expression in
a binarized case-control gene expression data set.
– The absolute difference of the relative supports of itemset α in A and B is
defined as
Previous Work on Discriminative Pattern Mining
• Two-step approaches
– Find all frequent patterns and then use class labels to select
discriminative patterns. [Wang and Karypis 2005], [Cong et al 2005], [Deshpande
et al. 2005], [Cheng et al. 2007], etc.
– Disadvantage: Not suitable for dense and high dimensional data
• Approaches that make limited use of class labels
during pattern mining
– Loose upper bounds of measures of discriminative power are
derived, e.g. DiffSup, Information Gain, etc. (Emerging Patterns
[Dong et al. 1999], [Bay, Pazzani 2001] (CSET), [Cheng et al
2008], [Fan et al. 2008] etc.)
– Disadvantage: More scalable than 2-step approaches, but still
not suitable for dense and high dimensional data, especially for
discovering low-support patterns.
Our Focus
• Dense & high dimensional data, such as SNP and gene
expression data.
• Discover interesting patterns with relatively low support
that can not be discovered by traditional approaches.
Traditional approaches (2-step & CSET)
Our focus: Scalable
discovery of low
support patterns
SupMaxPair: a New Measure for Discriminative Pattern Mining
• Make use of the class labels more effectively
Interesting lowsupport patterns
that may NOT be
discovered by
existing algorithms
High support
patterns that can be
discovered by most
algorithms
Uninteresting
patterns that can
NOT be pruned by
existing algorithms
Low support
patterns that can be
pruned by most
algorithms
Fang et al., Discriminative Pattern Mining from Dense and High Dimensional Data, TR 09-011, CS
Department, Univ of Minnesota, 2009. Software available at http://vk.cs.umn.edu/SMK/
Gene Expression Dataset
• 25000 genes in 295 breast cancer patients
– 78 patients developed metastasis within 5 years of surgery.
– 217 patients did not develop metastasis within 5 years of surgery
• 5981 genes are selected using the pre-processing
methodologies suggested by [Vijver et al. NEJM 2001]
• Binarization
– Each column pertaining to the expression of a single gene is split
into two binary columns (+0.2, -0.2)
– 11962 binary columns
Patterns Discovered by CSET and SupMaxPair
• More low-support and interesting patterns are discovered
by SupMaxPair beyond CSET
Next, we will demonstrate these are statistically and biologically
significant
Permutation Test
Statistically
significant
• Part (a): 1000 highest p-value pattern generated from 1000 sets
of randomized labels.
• Part (b): Top-300 p-value patterns discovered by SupMaxPair
using the true class label.
Validation using Known Cancer Genes
• A list of ~2400 human genes (Higgins et al, 2007) known to be
involved in the induction, progression and suppression of various
types of cancers
– 611 genes overlap with the dataset used in the study
• For each pattern, a precision is computed w.r.t. the 611 cancer genes
28 size-3 patterns have 100%
precision (all 3 are cancer
genes), and they are only
discovered by SupMaxPair
but not CSET.
E.g., all genes in the pattern
{BIRC5, LAD1, TK1} are
known cancer genes.
Biologically significant
Validation using Known Cancer Genes
• For the genes covered by a set of patterns, a global precision and
recall measure can be computed w.r.t the 611 cancer genes.
CSET
Varying
p-value
cutoffs
SupMaxPair
SupMaxPair can recall
additional biologically
significant cancer genes
beyond CSET, by
dicovering low-support
patterns
Pattern-based Classification
• CSET Patterns vs. (CSET + SupMaxPair) Patterns
– Patterns used as features for classification using SVM
– Data set split 80-20 for training-testing (500 iterations)
SupMaxPair can
complement
traditional algorithms
(e.g. CSET), for better
classification.
Summary
• On dense and high dimensional data:
– SupMaxPair can complement traditional approaches by
discovering statistically and biologically significant
discriminative patterns with relatively low support.
• Can be useful for discovering gene group
biomarkers for diseases.
Fang et al., Discriminative Pattern Mining from Dense and High Dimensional Data, TR 09-011, CS
Department, Univ of Minnesota, 2009. Software available at http://vk.cs.umn.edu/SMK/
Thank you!
• Group Members:
www.cs.umn.edu/~kumar/dmbio
Gaurav
Pandey
Gowtham
Atluri
Gang
Fang
Vipin Kumar
University of Minnesota
[email protected]
www.cs.umn.edu/~kumar
Rohit
Gupta
Michael
Steinbach