Transcript Algorithm

國立雲林科技大學
National Yunlin University of Science and Technology
Entropy-based Subspace Clustering
for Mining Numerical Data
Advisor :Dr. Hsu
Graduate: Yu Cheng Chen
Author: Chung-hung Cheng, Ada Wai-chee Fu ,
Yi Zhang
ACM 1999
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline










Motivation
Objective
Introduction
Related Work
Criteria of Subspace clustering
Entropy-based Method
Entropy vs the Clustering Criteria
Algorithm
Experiments
Conclusions
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation
Real-life databases contain many attributes.
Most traditional clustering methods are shown to
handle problem sizes of several hundreds to several
thousands transactions.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective
Propose a method that gives reasonable performance
on high dimensionality and large data sets.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
A good clustering algorithm need:
Handle arbitrary shapes for clusters
Do not make assumptions about distribution of data
Not be sensitive to the outliers
Not require input parameters
Convey the resulting clusters to the users
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
A solution to the above problem would consist of the
following steps:
(1) Find the subspaces with good clustering
(2) Identify the clusters in the selected subspaces.
(3) present the result to the users
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Related Work
CLIQUE is the only published algorithm that
satisfied to identify clusters embedded in
subspaces of datasets.
Two parameters, ξand τ.
Partition every dimension into ξintervals of
equal length (unit).
A unit is dense if data points contained in it is
>τ
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Related Work
Clusters are unions of connected dense units.
To reduce the search space we used a bottomup algorithm.
If a collection of points S is a cluster in k dimensional
space, then S is also part of a cluster in (k-1)
dimensional projections of space.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Related Work
Example
Intelligent Database Systems Lab
Criteria of Subspace Clustering
High coverage.
High density
Correlation of dimensions
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
Entropy-based Method
Entropy is defined as following:
Calculation of Entropy
where d(x) be the density of a cell x in terms of the
percentage of data contained in x.
Intelligent Database Systems Lab
Entropy vs the Clustering Criteria
Entropy and the coverage criterion.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Entropy vs the Clustering Criteria
We want to establish the relationship that,
under certain conditions, the entropy decreases
as the coverage increases.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Entropy vs the Clustering Criteria
Entropy and the density criterion.
Assume that the density of dense units are all equal to
α, the density of non-dense units are all equal to p.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Entropy vs the Clustering Criteria
Entropy and variable correlation
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
Algorithm
Overall strategy consists of three main steps:
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Algorithm
A subspace whose entropy is below w is considered to
have good clustering.
We start with finding 1 dimensional subspace with
good clustering, then we use them to generate the can
candidate 2 dimensional subspaces.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Algorithm
Dimensions Correlation
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Algorithm
We define the term interest as below:
The higher the interest, the stronger the correlation
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Algorithm

Downward closure is a pruning property.
─

If a subspace does not satisfy this property, we can
cross out all its super-spaces
Upward closure is a constructive property.
─
If a subspace satisfies the property, all its
Super-spaces also satisfy this property.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Algorithm
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments
We use data of 10 dimensions and 300,000 transaction
in the experiment.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments
Figure 10 & 11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusions
We establish some relationship between entropy and
the three criteria.
We incorporates the idea of using a pair of downward
and upward closure properties which is shown
effective in the reduction of the search space.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Personal Opinion

…
Intelligent Database Systems Lab