Turing Clusters into Patterns: Rectangle

Download Report

Transcript Turing Clusters into Patterns: Rectangle

Turing Clusters into Patterns:
Rectangle-based Discriminative
Data Description
Byron J. Gao and Martin Ester
IEEE ICDM 2006
Adviser: Koh Jia-Ling
Speaker: Liu Yu-Jiun
Date: 2006/11/8
Introduction
 The goal of data mining is to discover
useful knowledge.
 Present the clusters as the sets of points.
 Interpret the clusters as the humancomprehensible patterns.
 In the past, only concern the length of patterns,
and descript the cluster C directly.
2
SOR description
 Sum of Rectangles ( SOR ) is the canonical
format for cluster descriptions.
 SOR  for C : either SOR or SOR 
Black: cluster C (R1 and R2)
Red: other cluster (R1’)
Green: Bc
SOR description: R1 + R2
SOR  description: Bc – R1’
SOR  SOR   kSOR 
3
ESOR (C )  R1  R2
Notations
4
Example
R2
R3
R4
E SOR (C )  R1  R 2  R3  R4  R5
E SOR (C )  Bc  ( R1' R2' R3' )
R5
R2’
R1
R3’
5
Problems
 Maximum Description Accuracy (MDA)
 Minimum Description Length (MDL)
 A novel description: kSOR  description
6
Accuracy Formula
 recall  E  C / C
 precision  E  C / E
 f 
2  recall  precision
recall  precision
Two additional measures:
1. Recall at fixed precision. (fix precision = 1)
2. Precision at fixed recall. (fix recall = 1)
7
Three Heuristic Algorithms
 Learn2Cover  MDL  approximating max length.
 Length of rectangle.
 DesTree  MDA  approximating the Pareto front.
 FindClans  transforms the output from DesTree
into the shorter final description.
8
Learn2Cover
o x is the next point from Bc
in the sorted order.
9
Cost of Learn2Cover
l j (R ) : the length of rectangle R along dimension Dj.
R’
: the expanded R in covering
ox
10
DesTree
 DesTree takes the output from
Learn2Cover, R or R -, as input.
 Build the tree from bottom to up.
 Merge the child nodes into parent
nodes until a single node is left.
 Each node represents a rectangle.
 The higher in the tree we cut, the
shorter the length and the lower the
accuracy.
11
merge
12
FindClans
 FindClans takes as input a cut from

DesTree, outputs a kSOR description.
13
Algorithm -- FindClans
14
Experimental
 Compare with CART and BP.
 Real datasets from the UCI repository,
where data records with the same
class label were treated as a cluster.
15
Comparisons with CART
 Concern both of MDA and MDL.
16
DesTree vs. CART
accuracy
length
17
Comparisons with BP




BP addresses the MDL problem only.
Synthetic datasets.
Gaining 20%~50% length reduction.
Learn2Cover without violation checking, so
faster than BP.
18
Conclusions
 kSOR  provides enhanced expressive
power.
 MDA allows trading accuracy for
interpretability.
 A paradigm for query-based “secondgeneration” database mining systems.
19