yanting-talk

Download Report

Transcript yanting-talk

Math 3346 Data Mining
Presentation
Yanting Ji 4465439
2009-10-29
Today I will talk about the paper:
An Association Analysis Approach to
Biclustering
By
Gaurav Pandey, Gowtham Atluri, Michael Steinbach,
Chad L. Myers and Vipin Kumar
Department of Computer Science & Engineering,
University of Minnesota, Minneapolis, USA
Tour of the talk
•
•
•
•
•
Background
Range Support Patterns
Experimental Results
Summary
Comments
Background
Data sets: represented by a matrix M, where its
entries, Mij , is the value of item j in transaction i.
(in this paper, microarray data set is used.)
Bicluster: groups of items that show coherent
values across a subset of transactions or example.
Note: in case of clustering, we always use M rather
than a sub-matrix of M.
Background
Importance of the biclustering :
• discover
transcription
modules
from
microarray data, which denote groups of
genes that show coherent activity only across
a subset of all the conditions constituting the
data set.
• reveal important information about the
regulatory mechanisms operating in a cell.
Background
Types of the biclusters
1.Constant value biclusters
Background
2. Constant row biclusters
Background
3. Coherent value biclusters (additive model)
Background
4. Coherent evolution biclusters
Background
Recently used Algorithm :
1. ISA
2. SAMBA
3. CC
4. xMotifs
5. CTCWC
6. OPSM
7. LCD
8. Co-clustering techniques
Now we turn to the Range Support Patterns (RAP)
Overview of CC algorithm
• History:
In 2000 Cheng and Church proposed a greedy
heuristic algorithm to find biclusters in microarray
data set, which consists of a set of genes having
coherent expression values across a set of conditions.
• The goal of this algorithm is to attain the lowest
mean squared error (MSE) score by the following
formula:
Overview of ISA algorithm
ISA is short for Iterative Signature Algorithm.
• This algorithm aims to find biclusters that consist of genes
that show significant expressionindividually, and also a high
degree of co-expression with each other over a group of
conditions.
Three steps :
1. In the first step, a group of genes G0 is chosen randomly, and a gene
score of 1 is assigned to each of them. The condition scores for all
the conditions are computed over these genes, and the conditions
whose absolute score is greater than a user specified threshold tc
are selected as C0.
2. In the second step, the gene scores for all genes are computed over
these selected conditions and the genes with gene scores greater
than a user specified threshold tg are selected as G1.
3. Above two steps are repeated until the algorithm converges to a
group of genes Gn.
Range Support Patterns
The Range Support Patterns consists of two
parts:
1.The range support measure.
2.The algorithm.
Range Support Patterns
Firstly, a support measure for real-valued data is
defined.
Formally, given a data set D consisting of a set of
transactions T, which contains a value V(t,a) for
each item a in each transaction t, and a range
threshold d, the RangeSupport of a real-valued
itemset I = {i1, i2, . . . , ik} is defined as
RangeSupport(I) =  RS (t, I )
tT
where RS(t, I) is defined as
Range Support Patterns
Range Support Patterns
• What’s more, we note that the RangeSupport
measure is anti-monotonic. (proof omit)
• Now we know that the RangeSupport
measure emphasizes on two characteristics of
real-valued data:
1.Range
2.Sign of values of the itemsets in a patterns.
Range Support Patterns
Secondly, we turn to the Algorithm for finding range
support patterns from real-valued data.
• Since we defined the RangeSupport measure for
real-valued data, that tries to ensure the
coherence and sign of values in a group of items
in a pattern, while maintaining the antimonotonicity property, an Apriori-like algorithm
is easily employed for finding range support
patterns from a data set.
Experiment
• It is time to evaluate the efficacy of range support pattern
mining technique for finding coherent gene groups from
microarray data, while results with those obtained from a
similar analysis the CC and ISA biclusters are in the control
groups. (Why choose the CC and ISA biclusters?)
• Two major methodologies:
1. Evaluation using an objective measure of coherence,
the mean square error (MSE) of the values in a
bicluster.
2. Evaluation of biclusters in terms of functional
coherence, i.e functional enrichment of patterns.
Experimental Results
• The table below shows the details about the
biclusters/patterns discovered using the RAP,
CC and ISA algorithm by using different
parameter settings. The size range and
coverage numbers are computed only for the
finally selected non-overlapping patterns.
Experimental Results
Table :
Experimental Results
Observations from the table:
1. it can be seen that the biclusters produced by ISA and CC
generally contain larger number of genes than those found by
RAP.
2. CC and ISA biclusters generally cover many more genes than
RAP patterns.
3. the run time of RAP as compared to other biclustering
algorithms, which is comparable to the ISA runs, but much
faster than the CC.(This result actually is not that important)
Note: only RAP3,RAP5,CC1,ISA2,ISA4 and ISA6 are
chosen to perform later evaluation. Why?
Experimental Results
First evaluation: (MSE)
• Measure the coherence of each bicluster using
the MSE score defined in The CC algrithm.
• Analyze the distribution of these scores for all the
sets of biclusters discovered by the different
algorithms applied.
Result is shown in the graph as follow:
Experimental Results
Experimental Results
Some facts in the graph above:
• Since the closer the distribution of scores for a set of
biclusters is to zero, the closer they are expected to
capture the constant row model
• The scores for the range support patterns in RAP3 and
RAP5 are almost all zero, with very few outliers. On the
other hand, CC1 patterns have a much wider variability
of these scores.
• Hence we can conclude quantitatively that RAP is more
likely to capture the constant row model than the CC.
Experimental Results
The graphs below are sub-matrices of the data
set corresponding to the biclusters from CC1,
ISA2, RAP3.
Experimental Results
Observation and Conclusion:
• Although RAP3 has the highest MSE, it
corresponding graph shows more coherence
in its row than low MSE CC1 and ISA2’s graphs.
• These three graphs qualitatively demonstrate
the better ability of RAP pattern to find
accurate type2 biclusters (constant row
biclusters).
Experimental Results
Second evaluation: functional enrichment
of patterns
• Methodology: determine what fraction of the
patterns have a p−value smaller than a
specified threshold for at least one of the
functional classes in the consideration set.
• Criterion: the lower this p−value, the more
functionally enriched this gene group is with
this class.
Experimental Results
Since the size of groups can impact heavily on
the P-value, all groups are divided into two
categories: small classes with 1-30 elements,
and large classes with 31-500 elements.
Results are as follow:
Experimental Results
Experimental Results
• By doing this evaluation, we can find that
The RAP are quite useful discovering
patterns
that
represent
smaller
functional classes rather than the larger
biclusters.
Summary
• An efficient framework RAP for directly mining
association patterns from real-valued data
sets is setup.
• This algorithm is based on the novel antimonotonic range-support measure.
• Through comparison, RAP is quite useful for
smaller functional classes and quite accurate
for capturing the constant row model.
Comments
• Only test the microarray data set. How about other data sets?
• We know that a Apriori algorithm is computationally expensive, but
the time cost of Apriori-like algorithm for RAP shown in the paper is
quite good. It is better to demostrate details about how to modify
the Apriori algorithm in this case.
• In the ISA case, the initial value of each three sets of ISA chosen is
500, and it also says that the greater initial value is, the better result
will be. I think some large initial values other than 500 can be used.
• Comparison between RAP and CC in first evaluation is good, but the
comparison between RAP and ISA seems to be meaningless.
• As we can see, in the RAP case, the size of pattern is quite small.
This implies that if there exist some large size of pattern, this
method will be invalid.
Question and comments
Apriori Algorithm (From lecture notes)
• Basic principle: Any subset of a frequent itemset
must be frequent
• Find the frequent itemsets: the sets of items that
have minimum support
• A subset of a frequent itemset must also be a
frequent itemset.
• If AB is a frequent itemset, both A and B should be a
frequent itemsets.
• Iteratively find frequent itemsets with cardinality
from 1 to k.
• Use the frequent itemsets to generate association
rules.
Apriori Algorithm
•
•
•
•
Ck : Candidate itemset of size k
Lk : Frequent itemset of size k
L1 ={frequent items}
For (k = 2; Lk 6= 0; k + +)
Ck = candidates generated from Lk−1
• For each transaction t belongs to D
increment count of candidates in Ck contained
in t
Lk = candidates in Ck with at least min support.
P-value
• p-value is the probability of obtaining a test
statistic at least as extreme as the one that
was actually observed, assuming that the null
hypothesis is true. The fact that p-values are
based on this assumption is crucial to their
correct interpretation.
• The lower the p-value, the less likely the result.
The End