Turing Clusters into Patterns: Rectangle
Download
Report
Transcript Turing Clusters into Patterns: Rectangle
Turing Clusters into Patterns:
Rectangle-based Discriminative
Data Description
Byron J. Gao and Martin Ester
IEEE ICDM 2006
Adviser: Koh Jia-Ling
Speaker: Liu Yu-Jiun
Date: 2006/11/8
Introduction
The goal of data mining is to discover
useful knowledge.
Present the clusters as the sets of points.
Interpret the clusters as the humancomprehensible patterns.
In the past, only concern the length of patterns,
and descript the cluster C directly.
2
SOR description
Sum of Rectangles ( SOR ) is the canonical
format for cluster descriptions.
SOR for C : either SOR or SOR
Black: cluster C (R1 and R2)
Red: other cluster (R1’)
Green: Bc
SOR description: R1 + R2
SOR description: Bc – R1’
SOR SOR kSOR
3
ESOR (C ) R1 R2
Notations
4
Example
R2
R3
R4
E SOR (C ) R1 R 2 R3 R4 R5
E SOR (C ) Bc ( R1' R2' R3' )
R5
R2’
R1
R3’
5
Problems
Maximum Description Accuracy (MDA)
Minimum Description Length (MDL)
A novel description: kSOR description
6
Accuracy Formula
recall E C / C
precision E C / E
f
2 recall precision
recall precision
Two additional measures:
1. Recall at fixed precision. (fix precision = 1)
2. Precision at fixed recall. (fix recall = 1)
7
Three Heuristic Algorithms
Learn2Cover MDL approximating max length.
Length of rectangle.
DesTree MDA approximating the Pareto front.
FindClans transforms the output from DesTree
into the shorter final description.
8
Learn2Cover
o x is the next point from Bc
in the sorted order.
9
Cost of Learn2Cover
l j (R ) : the length of rectangle R along dimension Dj.
R’
: the expanded R in covering
ox
10
DesTree
DesTree takes the output from
Learn2Cover, R or R -, as input.
Build the tree from bottom to up.
Merge the child nodes into parent
nodes until a single node is left.
Each node represents a rectangle.
The higher in the tree we cut, the
shorter the length and the lower the
accuracy.
11
merge
12
FindClans
FindClans takes as input a cut from
DesTree, outputs a kSOR description.
13
Algorithm -- FindClans
14
Experimental
Compare with CART and BP.
Real datasets from the UCI repository,
where data records with the same
class label were treated as a cluster.
15
Comparisons with CART
Concern both of MDA and MDL.
16
DesTree vs. CART
accuracy
length
17
Comparisons with BP
BP addresses the MDL problem only.
Synthetic datasets.
Gaining 20%~50% length reduction.
Learn2Cover without violation checking, so
faster than BP.
18
Conclusions
kSOR provides enhanced expressive
power.
MDA allows trading accuracy for
interpretability.
A paradigm for query-based “secondgeneration” database mining systems.
19