Automatic Subspace Clustering Of High Dimensional Data For Data

Download Report

Transcript Automatic Subspace Clustering Of High Dimensional Data For Data

Automatic Subspace Clustering Of High
Dimensional Data For Data Mining Application
Author:Rakesh Agrawal
Johannes Gehrke
Dimitrios Gunopulos
Prabhakar Raghavan
Schedule
•
•
•
•
•
•
•
(1).Introduction
(2).Motivation
(3).Contributions Of The Paper
(4).Subspace Clustering
(5).CLIQUE(Clustering in Quest)
(6).Performance Experiments
(7).Conclusions
Introduction
• (1).Clustering is a descriptive task that seeks to identify homogen•
-eous groups of objects based on their attributes(dimensions).
• (2).Clustering techniques have been studied in statistic(Multivariate
•
Analysis:Cluster Analysis),machine learning and database (
•
CLARNANS,BIRCH,DBSCAN)
• (3).Clustering techniques:(a)Partitional Clustering:K-Means
method,CLARANS,BIRCH,DBSCAN.(b).Hierachical Clustering
Motivation
• The need for developing new algorithms
• (1).Effective treatment of high dimensionality
•
To effectively extract information from a huge amount of data
•
in databases. In other words.The running time of algorithms
•
must be predictable and usable in large database.
• (2).Interpretability of results:User expect clustering results in the
•
high dimensional data to be interpretable,comprehensible.
• (3).Scalability and usability:
•
Many clustering algorithms don’t well in a large database may
•
contain millions of objects, Clustering on a sample of a given
•
data set may lead to biased results.In other words,The clustering
•
technique should be fast and scale with the number of dimensio•
-ns and the size of input and insensitive to the order of input data.
Contributions of the paper
•
•
•
•
•
•
•
•
•
(1).CLIQUE satisfies the above desiderata (Effective,interpretability,
Scalability and Usability)
(2).CLIQUE can automatically finds subspaces with high-density
clusters
(3).CLIQUE generates a minimal description for each cluster in DNF
expressions.
(4).Empirical evaluation shows that CLIQUE scales linearly with the
number of input records and has good scalability as the number
of dimension in the dimensionality of the hidden cluster.
Subspace Clustering
•
•
•
•
•
•
•
•
•
•
•
•
•
•
What’s (a).unit (b).dense unit (c).a cluster (d).a
minimal description
of a cluster.
In Figure 1,the two dim space(age,salary) has been
partitioned by a 10x10 grid.ξ=10
The unit u=(30≤age<35)Λ(1≤salary<2)
A and B are both region
A=(30≤age<35)Λ(4≤salary<8)
B =(40≤age<60)Λ(2≤salary<6)
Assuming the dense units have been shaded,
AUB is a cluster( A,B are connected regions)
A∩B is not a maximal region.
The minimal description for this cluster AUB is the
DNF expression: ( (30≤age<35)Λ(4≤salary<8))v
( (40≤age<60)Λ(2≤salary<6))
•
•
•
•
In Figure2. Assuming T=20%(density threshold)
If selectivity(u)>T then u is a dense unit.
Where selectivity in the fraction of total data points
Contained in the unit.
•
•
•
•
•
•
No 2-dimen unit is dense and there are no clusters
In the original data space.
But if the points are projected on the salary dimen-sion , there are three 1-dim dense units,and there are two clusters in the 1-dim salary subspace,
C’=5≤salary<7 and D’=2≤salary<3
•
•
But there is no dense unit and cluster in 1-dim age
subspace
CLIQUE Algorithms
•
•
•
•
CLIQUE consists of the following three steps:
(1).Identification of subspace that contain clusters.
(2).Identification of clusters .
(3).Generation of minimal description for the clusters.
(1).Identification of subspace that
contain cluster
•
•
•
•
•
•
•
•
•
•
•
•
The difficulty in identifying subspaces that contain clusters lies in
finding dense units in different subspaces.
(1).we use a bottom-up algorithm to find dense units that exploits
the monotonicity of the clustering criterion with respect to dim-ensionality to prune the search space.
Lemma1 (monotonicity):If k-dim unit is dense ,then so are it’s
projections in (k-1)-dim space.
The bottom-up algorithm procceds
(a).Determines 1-dim dense unit and interaction(self-join) to get
2-dim dense unit. Until having (k-1)dim dense units, We can
self-join DK-1 to get the candidate k-dim units.
(b).we discard those dense units from Ck which have a projection (k-1)-
•
-dim that is’nt is not included in Ck-1 .
•
(2).Making the bottom-up algorithm faster with MDL-base purning.
•
(2).Finding Clusters
• The input to the step of CLIQUE is a set dense units D all in the
same k-dim space ,The output will be a partition of D into D1, D2
• ,…,Dq,such that all units in Di are connected and no two units
• Ui E Di ,Uj E Dj with iǂj are connected.
• Depth-first search algorithm
• We use a Depth –first search algorithm to find the connected
components of the graph,By starting with some U in D,Assign it the
first cluster number and find all the units it is connected to,then if
there still are units in D that have not yet been visited , we find one
and repeat the procedure.
(3).Generating minimal cluster
descriptions
•
•
•
The input to this step consists of disjoint clusters in k-dim subspace.
The goal is to generate a minimal description of each cluster with two steps:
(1)Covering with maximal region.
(2).Minimal cover.
• (1). Covering with maximal region
• We begin with an arbitrary dense unit U1 E C and greedily grow
amaximal region R1 that covers U1, we add R1 to R , Then we find
another unit U2 E C that is not yet covered by any a maximal regions
in R, we greedily grow a maximal region R2 that covers U2, we repeat
this procedure until all units in C are covered by some maximal region
in R .
•
How to obtain a maximal region covering a dense unit U
(3).Minimal Cover
• The last step of CLIQUE take as input a cover for each cluster and
finds a minimal cover.
• Minimality is define in terms of the number of maximal regions
required to cover the cluster.
• Removal heuristic:
• (1).Remove from the cover the smallest(in number of units)
maximal region which is redundant.(i.e. every unit is also contained
in some other maximal region)
• (2).Break ties arbitrarily
• (3).Reape the procedure until no maximal region can be removed.
Performance Experiments
• We now empirically evaluate CLIQUE using synthetic data
(Generator from M.Zait and H.Messatfa.A,Comparative study of
clustering methods)
• The goals of the experiments are to assess the efficiency
of CLIQUE:
• Efficiency :Determine how the running time scales with
•
-Dimensionality of the data space.
•
-Dimensionality of clusters
•
- Size of data.
• Accuracy:Test if CLIQUE recovers known clusters in
some subspaces of a high dimensional data space.
Synthetic data results
Accuracy:Comparisions with
BIRCH, DBSCAN and SVD
We used clusters embedded in 5-dim subspaces while
varying the dimensional of the space from 5 to50.
CLIQUE was able to recover all clusters in every case.
Real data results
Conclusions
• (1)The problem of high dimensionality is often tackled by requiring
the user to specify the subspace for clusteranalysis.But useridentification of quite error-prone.CLIQUE can find clusters
embedded in subspaces of high dimensional data without requiring
the user to guess subspaces that might have interesting clusters.
• (2)CLIQUE generates cluster descriptions in the form of DNF
expressions that are minimized for ease of comprehension.
• (3)CLIQUE is insensitive to the oder of input recods, Some
clustering algorithms are sensitive to the order of input data;for
example ,the same set of data,iput in different orderings,may
generate dramatically different clusters.
• (4).CLIQUE does not presume some canonical data distribution,that
is very useful in application.
• (5)Empirical evalution shows that CLIQUE scales linearly with the
size of input and has good scalability as the number of dimension in
the data.
• (6)CLIQUE can accurately discover clusters embedded in lower
dimensional subspaces.