Mining Maximal Dense Unit

Download Report

Transcript Mining Maximal Dense Unit

國立雲林科技大學
National Yunlin University of Science and Technology
Reducing Redundancy in
Subspace Clustering
Yi-Hong Chu, Ying-Ju Chen, De-Nian Yang, and
Ming-Syan Chen
IEEE TKDE, 2009
Reported by Wen-Chung Liao, 2009/11/10
1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outlines









2
Motivation
Objective
Preliminary
Mining Maximal Dense Unit
Discovery of nonextensible dense unit
The NORSC Algorithm
Experiments
Conclusions
Comments
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation

The “information overlappingdata coverage” dilemma.
─
─
─
─
3
grid-based subspace clustering
algorithms
the nature of the monotonicity
property will generate redundant
clustering results.
the information overlapping
problem
the data coverage problem
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective

Propose “NOnRedundant Subspace Cluster
mining” (NORSC)
─
─
4
to automatically trim off the redundant clusters
to discover a succinct collection of subspace clusters
while also maintaining the required degree of data
coverage.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Preliminary







5
divide each attribute into
δ equal-length intervals.
k-dim grid or unit
τis called the density threshold
dense unit
a cluster in a subspace is formed by a set of connected
dense units
Let a subspace cluster C be denoted as a tuple
<SC, D(C), U(C)>
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Preliminary
τ=0.3
EC(C1) = {C2 , C3}
|D(C2) ∪D(C3)| / |D(C1)| = 7/8
If α= 0.8, C1 is redundant.
D(A1,3)= {p1,p2,…, p8}
6
Intelligent Database Systems Lab
Mining Maximal Dense Unit
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Mining Maximal Dense Unit
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Discovery of nonextensible dense
unit
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Discovery of nonextensible dense
unit
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
THE NORSC
ALGORITHM
N.Y.U.S.T.
I. M.
Fk={k-dim dense units }
SCk={k-dim subspace
clusters }
= {u | u is dense,
dim(u)=k, p in DNE(u)}
= {u | u is dense,
dim(u)=k, DNE(u) ≠ψ}
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
EXPERIMENTS
Real data sets
Traditional grid-based subspace clustering (CLIQUE)
Data Coverage Problem (by extended CLIQUE)
coverage loss ratio of C
12
Intelligent Database Systems Lab
EXPERIMENTS
N.Y.U.S.T.
I. M.
Cluster ratio
= #Clusters(NORSC) / #Cluster(CLIQUE)
NORSC, Accuracy on Real Data
=12 / 75 , α=0.95
NORSC, Accuracy on Synthetic Data
13
Intelligent Database Systems Lab
EXPERIMENTS
Unit Count Computation
DUC_Reduction= N.Y.U.S.T.
(the reduced amountI. M.
of dense units)
(the total number of
dense units)
UC_Ratio=
units(unit count in
NORSC)
units(unit count in
CLIQUE)
Time/Space Efficiency
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
EXPERIMENTS
Redundancy Threshold
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusion

16
The NORSC algorithm to automatically
discover a succinct collection of subspace
clusters while also maintaining the required
degree of data coverage.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Comments

Advantage
─


Shortage
─ …
Applications
─
─
─
17
leverages the maximal dense units to generate
nonredundant clusters
gene expression data analysis,
E-commerce,
DNA microarray analysis,
Intelligent Database Systems Lab